Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Automating Machine Learning and Deep Learning Workflows

Automating Machine Learning and Deep Learning Workflows



Mourad Mourafiq discusses automating ML workflows with the help of Polyaxon, an open source platform built on Kubernetes, to make machine learning reproducible, scalable, and portable.


Mourad Mourafiq is an engineer with more than 8 years of experience. He has been working in different roles involving quantitative trading, data analytics, software engineering, and team-leading at EIB, BNP Paribas, Seerene, Kayak, Dubsmash. He is currently working on a new open source platform for building, training, and monitoring large scale deep learning applications called Polyaxon.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Mourafiq: This talk is going to be about how to automate machine learning and deep learning workflows and processes. Before I start, I will talk a bit about myself. My name is Mourad [Mourafiq], I have a background in computer science and applied mathematics. I've been involved and working in the tech industry and the banking industry for the last eight years, and I've been involved in different roles involving mathematical modeling, software engineering, data analytics, data science. For the last two years, I've been working on a platform to automate and I manage the whole life cycle of machine learning and the model management, called Polyaxon.

Since I will be talking about a lot of processes and best practices and ideas to basically streamline your model managements at work, I'll be referring a lot to Polyaxon as an example of a tool for doing these data science workflows. Several approaches and solutions are based on my own experience developing this tool, and talking with customers and the community users since the platform is open source.

What is Polyaxon?

What is Polyaxon? Polyaxon is a platform that tries to solve the machine learning life cycle. Basically, it tries to automate as much as possible so that you can iterate as fast as possible on your model production and model deployments. It has a no lock-in feature. Basically, you can deploy it on premise or any cloud platform. It's open source. I believe that the future of machine learning will be based on open source initiatives. We already saw how the Python scientific libraries had huge impacts on the tech industry and also other industries. It's agnostic to the type of languages, frameworks, libraries that you are using for creating models, so it works pretty much with all and major deep learning and machine learning frameworks.

It can be used by solo researchers and it scales to different large teams and large organizations. Basically, it has built-in features for compliance, auditing, and security. Obviously, when we talk about new platforms for machine learning, a lot of people are still skeptical about, "Why do we need new tools to manage machine learning operations?" A lot of people would ask, "Why can't we use the tools that we already know and already love to automate data science workflows?"

Although I don't agree, I think I understand pretty much why people ask these kinds of questions. I think the software industry has matured a lot in the last couple of decades. We’ve produced a lot of tools and software to improve the quality of software engineers' work and make them a lot of tools for reviewing, sharing processes and also sharing knowledge, but I don't think that these tools can be used for machine learning. In fact, I think the software engineer has matured a lot, so that when we use, for example, words from other engineering disciplines or civil engineering in infrastructures or platforms, we feel that it makes sense. For machine learning, I think it's quite different, and to understand that, we need to ask ourselves two questions. What is the difference between traditional software development and machine learning development? The second aspect or the second question that we need to ask as well is, what is the difference between software deployments and machine learning deployments?

The Difference between Software Development and ML Development

The first big aspect or the first big question is, what is the difference between software developments and machine learning developments? There are three key aspects to this difference. The first one is, what do we need to develop when we're doing the traditional software? And what are the objectives? In general, we have some specifications, so a manager comes with some specification, engineers try to write code to answer all the aspects of that specification. If you are developing a form or an API, you have already an idea of where you want to get to.

In machine learning, however, it's quite different because we don't have specifications. We just try to optimize some metrics, whether you want to increase the conversion rates or improve the CTR, or the engagements in your app, or at the time people are consuming your feeds; that's the most important thing that you want to do and you don't have a very specific way to describe this. You need to optimize as much as possible your current metrics to have an impact on your business.

The second aspect is how do we vet and assess the quality of software or machine learning models? I think it's also very different. In software engineering, we developed a lot of metrics; we developed a lot of tools to do reviewing. We have metrics about complexity, lines of code, number of functions in a file or in a class, how many flows we need so that we can understand a piece of software in an easy way, and then we can have the green light to deploy it.

In machine learning, you might try the best piece of code based on TensorFlow or Scikit-learn or PyTorch, but the outcome can still be valid because we have another aspect to that, which is data. The way you look at the data is not objective; it's very subjective. It depends on the person who's looking at the data and doing all these kinds of developing and intuition on the data. You need some industry insights. You need to think about the distribution, if there's some bias and you need to remove it. That way you also make sure that the quality of your model is good and is also different.

Finally, the tools that we use for doing traditional software developments and machine learning developments are different. In traditional software development in general, when you think about companies, you can't even say that "this company is a Java shop, or C++ shop, or Python shop." We think about companies by thinking about the most used language they have, the framework. A lot of people ask, "What are the companies using Rail?" Other people would say, "It's GitHub, GitLab." "Who is using Django?" They would say, for example, "Pinterest” or “Instagram." Recently, there was a news post on Hacker News about how Netflix is using Python for data science, and one of the people who made a comment was really surprised that they are using Python because he thought that it was a Java shop.

For data science, you don't think about frameworks. Data scientists probably will use different types of framework libraries. They just want to get the job done. If you can derive insights using Excel, you should use Excel. It doesn't matter what type of tools you use to have an impact on your business. Even this aspect is also different.

Deployments - I don't think that there is a big difference between deployments in terms of the traditional software engineer and machine learning deployments, but I think machine learning deployments are much more complex, because they have another aspect, which is the feedback loop. When you develop software and you deploy it, you can even leave it on an auto-complete process. If you have some sprints and you want to do some refinements, you might develop, for example, a form, and then if you miss validation, in the next sprint you can add this validation and deploy it, and everything should be fine. The people who are involved are the software engineer, maybe some QA, and then the DevOps.

Machine learning is different, because first of all, you cannot deploy in autopilot mode. The model will get stale, the performance would start decreasing and you will have some new data that you need to feed to the model to increase the performance of this machine learning model. Iteration is also different. Say you also have sprints and you did some kind of experimentation; you read some good results and you want to deploy them, but you still have a lot of ideas and a lot of configuration that you want to explore. The people who are involved in these refinements are completely different, because maybe you will ask some data engineer to be involved in doing some kind of cleaning or augmentation, or feature engineering, before you can even start doing the experimentation process. Then there's, again, the DevOps to deploy.

I think that there are major differences between normal, standard software development and machine learning development, which means that we need to think about new tooling to help data scientists and many other types of employees who are involved in the machine learning life cycle to be more productive.

What a ML Platform Should Answer

This is how I think about it. If you are thinking about building something in-house or adopting a tool, whether it's open source or paid, you need to think about how this tool can be flexible in order to provide and support open source initiatives. We all agree that every two weeks, there's a new tool about time to ease, anomaly detection, some kind of new classifier that your data scientist should be able to use in an easy way. If you are developing or thinking about developing a tool, it should allow this kind of support for new initiatives. It should also provide different types of deployments. When you think about deployments, you also think about how you are distributing your models, whether for internal usage or for consumers who are going to use an API call. So, you need to think about different deployments for the model.

Ideally, I think it should be open source, I believe that open source is the future of machine learning. We all need to think about giving back to the open source community and try to immerse specifications or some standard so that we can mature this space as fast as possible. You should allow your data scientists and other machine learning engineers or data analytics who are going to interact with the platform to integrate and use their tools. You just need to provide them with some augmentation on the tooling that they are using right now. It should scale with users and by that, it's not only the human factor, but also the computational factor, providing access to, for example, a larger cluster to do distributed learning or hyperparameter tuning, for instance. Finally, you need to always have an idea about how you can incorporate compliance, auditing, and security, and we'll talk about that in a bit. To summarize, these are all the questions that a machine learning platform should answer.

ML Development Lifecycle

This is how, at least from the feedback that I got from a lot of people, the developments or the model management for a whole life cycle should look like. You first need to start by accessing the data; this is the first step. If you don't have data, you just have a traditional software, so you need to get some data to start doing prediction and getting insights. Once we have access to the data, you need to provide different type of user access to create features, to do exploration, to refine the data, to do augmentation, to do cleaning, and many other kinds of things on top of data.

Once you have now the access to the data and the features, you can start the iterative process of experimentation. When we talk about the experimentation process, you need to think also about how we can go from one environment to another, how we can onboard new users or new data scientists in your company, how you can do risk management if someone is leaving. The best way is doing packaging. You need to think about the packaging format so that you can have reusability, portability and reproducibility of the experimentation process. Doing experimentation process by hand could be easy, but then you start thinking about scaling. As I said, you can scale the number of employees working on a specific project, but you can also just scale what one user can do by providing robust scheduling, robust orchestration, and hyperparameter tuning, and optimization.

When you scale the experimentation process, you will generate a lot of recall, a lot of experiments, and you need to start thinking about how you can get the best experiments, how you can go from these experiments to models to deploy, and it's very hard because one user can generate thousands or hundreds of thousands of experiments. Now, you need to think about how you can track those experiments generating in terms of metrics, artifacts, parameters, configurations, what data went into this experiment, and how we can easily get to the best performance in experiments.

This is also very important when you provide an easy way to do tracking; you will have auto documentation. You will not rely on other people creating documentation for your experiments, because you will have a knowledge center; you will have an easy way to also distribute the knowledge between your employees. Managers can also have a very good idea about when, for example, a model is good enough that you can expect it in two weeks, and then communicate that with other teams, for example, marketing or business, so that we can start a campaign about the new feature.

Finally, when you have all these aspects solved - you have a lot of experiments, you have decent experiments that you want to try out in production - you need to start thinking about a new type of packaging which is a packaging for the model, and it's different than the packing for the experiments. We will go back to all these aspects, but this is just to give you an overall idea. Then you can deploy and distribute the models to your users.

There's not only one data scientist behind the computer creating models; there are a lot of type of employees involved in this whole process. Some of them are DevOps, some of them are managers, and they need to have an idea; for example, if it is a new regulation and your data has some problems with this regulation, you need to know which experiments use which data, which models are deployed right now using this data, and you need to take it down or upgrade it, or change it. You might also, in your packaging, have some requirements or dependency on some packages that have security issues, and you need to also know exactly how you can upgrade or take down models.

Finally, if you went through all these steps, you already have models and they're probably having a good impact on your business; you need to refine and automate all these processes, going from one step to another. This is done by listening to events, for example, new data coming in on buckets, or probably because there are some automatic ways for just upgrading the minor of a package that you are using for deploying the models. You need to think about a workflow that can create different types of pipelines, to go from cashing all the features that you created in the second step, creating a hyperparameter tuning group, and take, for example, the top five experiments, deploy them, have an AV testing on these experiments, and keep two and do some in-sampling over these two experiments.

We will be talking about all these aspects one by one, and we'll start with the data access. As a data scientist, I believe that you need access to a lot of data coming from a variety of backends. Basically, you need to allow your data scientists and data engineers to access data coming from Hadoop, from SQL, from other cloud storages.

When you provide data to users, you also need to have some kind of governance. You need to know who can access this data. In a team that can access credit card data, probably not everyone in the company can have access to credit card data, but some users can access the data. You need to also think about the processes of doing some encryption or obfuscation of that data, so that other people who are going to intervene later on can access this data in a very simple way.

When you do have access to the data, you can start thinking about how you can refine this data and develop some kind of intuition about it, how can you develop features. When you want to create these features and all this institution about the data, you need to have plug-ins, you need to have some scheduling so that you can allow the users, data analysts or data engineers to use subnotebooks, some internal dashboards, and also create jobs that can just run for days, to create all these features. You need to have some kind of catalog. In doing that, you need to think about caching all these steps, because if you have multiple employees who need to have access to some features, they don't need to run the job on the same type of data twice, because it will just be a waste of computation and time. We need to think about cataloging of the data and also for the features.

Now that we have the data and the features already prepared, we can start the experimentation process which is an iterative process. You might probably start with working on some local version of your experiments to develop an intuition or a benchmark, but you might also want to use some computational resources, like TPUs or GPUs. We all know that when someone starts doing this experimentation, they start installing packages, pip install this, pip install that, and then after a couple of days you're asking someone else to run the experiments and they find themselves unable to even get the environment running. I think this is a big risk management. You need to think about the packaging of formats of the experiments so that you can have this portability and reusability of these artifacts.

When you start the experimentation, whether it's on a local environment or cluster users in general, they have different kinds of tooling, and you need to allow them to use all this tooling. I think one of the easiest way to do that is basically taking advantage of containers, and even for the most organized people who might have, for example, a Docker file, it's always very hard for other people to use those Docker files, or even requirements files, or conda environments. It's always super hard.

For example, in Polyaxon, we have these very simple packaging formats. It gives you an idea about how you can, for example, use a base image, what type of things that you want to install in this image. Thinking about user experience is super important when developing this, although in ad hoc teams. For the packaging, it should be super simple and super intuitive; what can you install and what you want to run, and this is enough for people to run it either locally or another environment. If you hired someone the next week or one of your employees is sick, the next person doesn't need to start reading documentation to recreate the environment. They can just run one command and they already have an experiment running, and they start having really empirical impressions about how the experiment is going.

The specification should also allow more than just simple use cases, it should allow more advanced use cases, for example, using local files or using conda environments or cloned environments. Once you start doing experimentation locally or even on the cluster, you might start thinking about how we can scale this experimentation process. The first one is you can hire some more people, and when hiring more people, they probably don't have the same gear or you want to centralize all the experimentation process in one place in a cluster, and now you need to start thinking about scheduling and orchestration - for example, using Kubernetes to take advantage of all the orchestration that it has and then building on top of that a scheduler that’s allowed to schedule to different type of nodes depending on who can access those nodes. By doing that, you give access to the cluster, to multiple users, and multiple users then start collaborating on this experimentation process.

You might also just give a couple of users more power, giving them the possibility to start distributed training, so using multiple machines. You don't ask them to become DevOps engineers, they don't need to create the deployment's process manually. They don't need to create a topology of machines manually and start training their experiments. The platform needs to know, for TensorFlow, PyTorch, these types of environments and these types of machines, and how they need to communicate. You also need to think about how you can do hyperparameters tuning, so that you can run hundreds of thousands of experiments in parallel.

Once you get to the point where you're running thousands of experiments, you need to start thinking about how you can track them so that you can create a knowledge center, and the source should take care of getting all the metrics, all the parameters, all the logs, artifacts, anything that these experiments are generating. User experience here is very important, from the data access and data processing to the experimentation, there different type of users, and different types of users are expecting different types of complexity or simplicity of what they're using, in terms of APIs and SDKs.

At Polyaxon this is the tracking API. It provides a very simple interface for tracking pretty much everything that a data scientist needs to report all the results to the central platform on. Once you have all this information, you can start deriving insights, creating, reporting, having a knowledge distribution among your team, having a knowledge center, basically. Now you have different types of employees who are also accessing the platform to also know how your progress is going. If you don't have this platform, managers will ask in an ad hoc way, who's doing what, what is the current state of the progress. But if you have tracking and you have a knowledge center where you can generate reports and knowledge, people can, from different types of departments, access the platform and get a quick idea of how the whole team is progressing. You need some insights, you need comparison between the experiments, you need visualizations, and you need to also provide a very simple way of adding other plug-ins, for example, using Tensor World or Notebooks.

At this point we already have a lot of experiments. We know how to get to the top performing experiments, and we need to start thinking about how we can deploy them. Deployment is very broad work because it could be for internal use, for some batch operation, it could also be deployments on a Lambda function, it could be an API or GRPC server, and you need to think about all these kinds of deployments that you need to provide inside the company. Lineage and the problems of the model are very important. When you deploy, you need to know how to get to this model, how can we easily track who creates this model using what, and if we should do some operation on top.

The packaging is also different than the experimentation, because the packaging for the model should know, "Given these experiments, I have some artifacts and it was using this framework. How can we package it as a container, and deploy it to the right destination?" When you already have a couple of deployments, you're seeing performance improving and managers are happy. There's probably a regulation that changed recently and you need to act as fast as possible, so you need the lineage and improvements of this model. You need an auditable workflow to have a rigorous workflow to know exactly how the model was created, and how we can reproduce it from scratch. Thinking about this, user experience is very important because if you have ad hoc teams working on different components, you need to provide them different type of interfaces to derive as many insights as possible.

We get to the state where we went through the experimentation, we created a lot of experiments, we generated reports, and we allowed a lot of users to access the platform. We are thinking now about how we can do refinements. With refinements, you should think about how you can automate as much as possible jumping from one aspect to another, because if you don't have an easy way to automate this jumping from one aspect to another, you will involve the same people going from a data analyst or a data engineer, machine learning practitioner, or data scientists, QA, and then DevOps, and then everyone need to do the same work again and again, and you need to think about how you can cache all the steps so that these people can only intervene if they need to intervene. Again, the user experience is important, so if you are doing events, action or pipeline engine, you need to think about what your users are doing right now. If they are using Jenkins or Airflow, we should not just push a new platform and ask them to change everything. You need to think about how you can incorporate and integrate this already-used tooling inside the company and justify augmenting their usage.

Finally, you need to think about what kind of event you are expecting so that you can trigger these pipelines. By event, this could come from different types of sources. If you are doing CICD for software engineering, you need to think about also CICD for machine learning. It's quite different because in here, not only do you have databases on code, if you have new code, you need to trigger some process or pipeline. You might also have that because you are polling for some data and there is new data, and you need to trigger this workflow. You might also trigger the workflow for different types of reasons. For example, you have a model that's already deployed or a couple of models that already deployed, and some metrics stop dropping. You need to know exactly what happens when a metric starts dropping. Are you going to run this pipeline or the other pipeline? Are you on multiple pipelines? What happens when the pipeline starts? Do you need some employees to intervene at that point with some manual work or is it just an automatic pipeline where it just starts training by itself, it looks at the data and deploys it?

At Polyaxon - it was supposed to be released last week, it’s open source - it's a tool called Polyflow. It's an event action framework which basically adds this event in mechanism to the main platform so that you can listen to Kafka streams or listen to new artifacts, get generated on some buckets or listen to hooks coming from GitHub to start experiments. This is where user experience is very important. I think not having a complete pipeline is important, but having just a couple of steps done correctly with user experience in mind is very important. When you are building something like these pipelining engines or this kind of framework, you need to think about what is the main objective that you are trying to solve, and I believe that is trying to have as much impact on your business as possible. To have this kind of natural impact, you need to make your employees very productive.

That's it for me for today. I hope that you at least have some ideas if you are trying to build something in-house in your company, if you are trying to start incorporating all these deep learning, machine learning advances and technologies. I cannot emphasize enough that user experience is the most important; whether we are a large company or not, or whether we have different types of teams working on different types of aspects of this life cycle, we should always have this large picture and not just be creating APIs that communicate in a very weird or very complex way. You need to think about who is going to access the platform. By that time, I think that the data analysts, data engineers, machine learning practitioners, data scientists, and DevOps, and engineers as well who are doing the APIs and everything, every one of these employees, every one of these users should have the right way of accessing the platform, the right way of seeing how the progress is going, the right way of also adding value to the whole process.

All these workflows are based on my own experience developing Polyaxon. It's an open source platform that you can pretty much use for doing a lot of things that I talked about right now. The future of machine learning will be based on all these kinds of open source initiatives, and I hope that in the future also as a community, we can develop some common specifications or common standards so that users can always integrate the new tools, can jump from one platform to another without having to feel like they're locked in into some system that will just have negative impacts on their productivity.

Questions and Answers

Participant 1: How does your tool connect to well-known frameworks, like TensorFlow or Keras? How is the connection with the tool with these kinds of frameworks for deep learning?

Mourafiq: You are talking about Polyaxon, I assume. In Polyaxon, there are different kinds of integrations. There's some kind of abstraction that is created and each framework has its own logic behind, but the end user does not know about this complexity. They just say, "This is Michael. This is the kind of data that I want access to. These are how I want to use the experiments. If it's distributed learning, I need five workers and two parameter servers," and the platform knows that this is for TensorFlow, not MXNet, so it creates all the topology and knows how to track everything and then communicate the results back to the user without them thinking about all these DevOps operations.

Participant 2: What kind of hyperparameter optimizations does Polyaxon support?

Mourafiq: At the moment, there are four types of algorithms that are built in the platform, Grid search and Random search, and there's Hyperband and the Bayesian optimization, and the interface is the same as I showed in the packaging format. This packaging format changes so that you can expose more complexity for creating hyperparameter tuning. Once you communicate this packaging format, the platform knows that it needs to create a thousand or two thousand experiments running. You can also communicate what the concurrency is that the platform should do for this experiment; for example, running 100 experiments at the same time. If you also want to stop the whole process at some point if some metrics or some of the experiments reach, for example, a metric level, you don't need to keep running all these experiments and consuming the resources of your cluster.

Participant 3: How do you keep version of the data? Because if you wanted to repeat some of these experiments later on and maybe you do not have the original data anymore or the original data source. How do you version the data?

Mourafiq: This is a very simple or minimalistic version that you provide, but you can also say what type of data you want to access, and the platform knows how to provide the type of credentials to access to the data. For tracking the versions, you can have this log; that's our reference. Then, if you are splitting the data or doing some more things during the experiments, you can have these caches that represent the data.


See more presentations with transcripts

Recorded at:

Aug 30, 2019