Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations How to Operationalize Transformer Models on the Edge

How to Operationalize Transformer Models on the Edge



Cassie Breviu discusses different model deployment architectures, how to deploy with edge devices and inference in different programming languages.


Cassie Breviu is a Senior Program Manager at Microsoft on the AI Frameworks team. She is a career switcher and self-taught developer. She is passionate about D&I and helping others learn to code. She enjoys working in technologies with two letter abbreviations: AI/ML, MR/VR/XR/AR. But really, she loves all things tech and enjoys building innovative solutions and trying out emerging technologies.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.


Breviu: I'm Cassie Breviu. I am a Senior Program Manager at Microsoft. I'm super excited to be talking to you about transformer models, and how we can operationalize them, and operationalize models in general, that will apply to things that aren't just transformers. Transformers really are the cutting edge natural language processing models that are out there right now. I started out as a full stack C# engineer, and then I moved into AI, because I love it. Over the years, I've just transitioned and so now I'm focusing more on AI. This particular project is something that brings everything together. I like it when you get to work on multiple domains. This project really does that. This is really meant to give you that path forward, give you the ideas, and helping you start to think how you can incorporate MLOps into your current system, and how you can think about deploying different models, and also help grow awareness around different pre-built models that are out there.


We're going to look at transformer models. We're going to look at open source models. We are going to be looking at different ways that you can deploy those models and what different considerations, pros and cons for each architectural decision. We'll get tooling. Then, of course, we'll do lots of demos to show you how to use these things.


This is what our end project will look like. I just wanted to show you so you can get an idea. What's happening here is we actually have a transformer model that has been optimized and deployed to a browser, and it's actually been inferencing on the client in the browser using the hardware of the desktop. If you take a look, it actually gets really good accuracy as well. Let's learn a little bit about how we got there. For transformer models, these are the state of the art models. You've probably heard of BERT, or GPT-2, or GPT-3. These are transformer models. They're actually starting to do some really cool things even with computer vision for transformer models. It really started in the Natural Language Processing space, and still is, but now they're starting to see how they can use the transformer style models in computer vision.

Transformer Challenges

Some of the biggest issues with transformer models is their size, they are absolutely huge. You need big, expensive hardware to do it. They're expensive to train. Then also there's a lot of latency because the inferencing takes longer because of how large they are. There's a lot of different techniques and things that have been happening in order to try to optimize, reduce model size, so we can make these really amazing models work more efficiently, which ultimately makes them more useful.

Model Hub

The model that we're going to be using is an optimized BERT model. We're going to be building that sentiment analysis real-time inference on device which you just saw. The model that we're going to be using is a model that actually Microsoft did the distillation technique on, to really reduce the size of the BERT model. Then we also made it test agnostic, so that allows us to go in and fine-tune it with an open dataset that is on actually the Hugging Face hub. Both this model is posted on the Hugging Face hub, which we'll take a look at, and also the emotions dataset that we're going to use to fine-tune it.


Let's take a look at the Hugging Face hub and take a look at this model and the dataset that we're going to be using. Then we're also going to fine-tune and train this model. Here's the model that we're going to be using, the XtremeDistil BERT model from Hugging Face. Here is the emotions dataset that we're going to be using also available on the Hugging Face hub. If you take a look, you can actually peruse through the data here right in the UI, which is pretty nice. Then we'll actually be able to download the model and the dataset and train all through the API for transformers. Now that you saw on the Hugging Face hub, where the model lives and the dataset, now let's take a look at the Notebook that shows us how we fine-tune this model. Here is the model name that we want to get. Here's the dataset, again, these are the Hugging Face APIs that have everything that we need, so it makes it really super easy to download the pre-trained model and the tokenizer. Here we are able to download the dataset and then we have our tokenizer from the transformer and from the pre-trained model. Then, down here, we are just mapping that dataset to the tokenizer with a function. We're splitting it out into test and train, and take a look at the length here. We're setting it to our CUDA GPU device in the model and then sending it to the device for training. It's on the right device for training.

For our evaluation, we want to take a look at what the accuracy is, and print that out. We're computing the accuracy based on the actual versus the predictions, and getting back how well it's doing. Then we're setting up our training arguments and creating our trainer, all with the transformers API. We're giving it our model, our training args, our datasets, and our compute metrics. We scroll down, you can see we're doing 24 Epochs, and then we can see how many steps in each optimization. As we go and train, we should be seeing the loss start getting lower. Scroll way down, by Epoch 13, you can see that our loss is down to 30, evaluation accuracy is up to 90. Then by the end, you can see what our loss and eval accuracy was up to. That's our last Epoch. I think that's a good enough model, we can do, trained out, evaluate as well, you get some final stats here, just like we saw in the final training layer. We're going to stop there now that our model is trained, and then let's go back and look at all of the ways that we could think about deploying this model, and how we might be able to optimize it a little bit more.


We have trained our model, now what? How are we going to deploy this? Let's look at some of the considerations that you need to think about once you're ready to move a model to production. Many times in your org, or in your company, you're going to have the data scientists creating the model, and they're going to pass this off to the ML engineer, who's then going to go through the operationalize process. There'll be multiple roles that probably exist. I think it's interesting to really think about it from the whole perspective. There's a lot of things to start considering when you're ready to move your model to production, and how that might fit into your current CI/CD. You need to think about hardware. The large transformer models need big GPUs, like A100s, and they take a lot of time and a lot of money to run inferencing. We looked at a new model now that's been distilled, and that we're going to learn how to optimize a little bit more, that's going to reduce the needs of the hardware. Depending on what you're building, or what you're doing, you may need bigger hardware. You're going to need to think about how big of a GPU do I need? How much RAM do I need? How much data am I going to have to be able to have in memory while I'm running my inferencing?

The hardware is a big part of moving your model to production and understanding the needs. That comes down to the size of the model and the speed needed, how long you're willing to wait for an inference. You can have a large GPU that's going to give you a lot faster results. For a model that might be able to run on a smaller GPU, how long are you willing to wait for that inference? We'll talk a little bit how you can get around some of those latency things and the different ways that you deploy your model. Scaling and continuous training, testing and development is another big one. You're going to have data drift, and eventually, you're going to have to retrain your model, and you need to make sure that it's still performing well. How do you get that model feedback? Then, also, make decisions on retraining and pushing new versions of the model. Those are some high level considerations and questions you ask yourself once you get to the point of, I have a model that I want to give to the world.

Machine Learning Workflow

When you start thinking about this whole process, from beginning to end, you can see, we had the prepared data, which was already prepared for us, which is great. Then we went into our Jupyter Notebook in VS Code to train our model. We were just using local compute. You also might need to offload that workload to cloud compute, which you can do with different tools like Azure Machine Learning that allows you to actually run it locally, or run your IDE locally, but offload the training and stuff to cloud compute, to speed things up, or maybe just because you don't have the specs that you need locally. Then we train. Then you might have some model versioning, and then we're going to build the image and deploy. When you start thinking about that overall process, we're really to that point where now we're ready to register, build, and deploy that model.

Deploy in Application/On Device

The first way that I'm going to talk about deploying the model is right in the application and on the device. Whether that is, you're doing inferencing in JavaScript, and inferencing on the client, like in a browser, or if you're deploying to a mobile application, and you want to deploy your model directly with the application and do all the inferencing on the device. Another example might be an application like a website that you're deploying, like maybe for C#, it's like a web app and you want to actually do your inferencing in C#, and you want to do that directly in the application, this would be one way to do it. There's some really nice pros to this, but then there are some limitations as well. It's really simple, because you have it all in this really neat package, and you can deploy it all as one. That's pretty cool. It works offline, because you don't need to call out to like a web API, or to call it where your model is hosted to get inferencing or get results. Then there's cost savings because you're not going to pay for that inferencing to happen. Then your model might be too large. If your model is too large to work on the device, then you're not going to be able to actually deploy it on device, or you'll have to look at different optimizations and ways to reduce the size and hardware needs of your model. The other thing is that when you want to update that model, you'll have to redeploy the whole application. Because it's part of the whole package, so in order to do new deployments and update that model, you're going to have to deploy the whole thing.

Client Inference Value Prop

Really, the key benefits to deploying on device is that lower latency time because you're doing model on device, there's no callout. It works offline, again. There's also an interesting consideration around privacy. Because you're doing inferencing on device, your data actually never leaves your device, therefore, you're never sending that over the network. From a privacy perspective, if that's something that's really important to you, doing inferencing on device is a good solution for that. Then also the cost savings. Because you don't actually have to pay for a server to be up and doing those inferencings, it's all offloaded to the device. Those are some really cool value propositions and things that allow client inferencing to be enticing.

Deploy as a Web API

The next way that you could think about deploying your model is deploying it as a web API. This is a really cool way because it gives you a lot of flexibility, because you're just going to set up a web service. If you already have some microservices architecture, this is just another service that's going to allow you to make a callout to get a result. You have that flexibility. You can do whatever hardware you need, you might have to pay more, but you're going to be able to scale your hardware the way that you need in order to get the inferencing results and latency and throughput that you want. It decouples, so it's going to be removing that app model from the application so you can do model deployments and versioning outside of ever deploying a new application, assuming that the outputs and inputs stay the same.

This one can also get more complex. As microservice architectures have become so popular, they are also still very complex with orchestration. We'll talk about some tooling that you can use to help that. Even if you're not doing microservices, maybe you're just doing a serverless like an Azure Function or something like that, you can still use this type of architecture. This tends to be my go-to just because it's really nice to be able to choose your language and your hardware and all of that, because most of machine learning is built in Python, or R. If you want to be able to put up like a FastAPI or a Flask application in Python, and do all of your data pre and post-processing in Python, right in the application, that is going to be a benefit. There's times where that's not going to work for you, and we'll talk about some solutions for that.

Preprocess Data Predictions

Another interesting way that you can deploy your model is by doing a, like overnight job or a batch process that statically stores your data predictions, that then your app just goes and gets those results from the database. This can be really useful. I had one model that I used this flow because it wasn't inferencing that I needed for real-time throughout the day, it was something where I could go get the data that had been added over the day. I could run inferencing. It was a classification on medical components. Then I could go get that information. I could run my classification. I could save that to a database, and then my app could just go grab and see what it had done. Or you can send a nightly report. Again, it depends on the problem that you're solving and what's important to you. If you don't need real-time inferencing, this could be a really good solution. It gives you the flexibility of the hardware choice, language choice. It decouples from the application again. There's also the latency consideration for on the con side, because you're not going to be getting real-time inferencing results, so you really have to think about the solution that works best for what you're trying to do.

Other Callouts

I mentioned some other tools around deployment. I just want to call some of these out because they're popular solutions. If you do Kubernetes, you should look into Kubeflow. It's a toolkit for allowing you to do machine learning within Kubernetes architectures. There's Apache Kafka. This is for very large scale, high performance data pipelines and streaming. That one might be what you need. Then there's something new that's becoming more popular is just these pre-built and configured Docker containers for inferencing specifically. I know pre-built containers are not new and exciting, they're very much standard use now. In the inferencing space, people are coming out with containers that are just pre-configured with all of the different tools that you need to optimize your inferencing. One particular is the Triton Inference Server from NVIDIA. There's also a lot of companies that like to do their own. They know what they want and then they make that image available. That's their go-to for their machine learning because they know their process, and they know what they want. Looking at different pre-built Docker containers with the tooling and things that you need for inferencing, is another thing to think about when you're starting to get past the point that I've chosen my architecture. Now, how am I going to actually take that next step?

Machine Learning on the Edge

We talked a little bit about the data science process where you have your data preparation, you have your model, and you're validating it, whether you're doing that training on the cloud or on-prem. I was just doing mine in a local Jupyter Notebook. Then you get to the point where you're deploying your model, which we just talked about. We thought about the different types of architectures we're going to use, and talked a little bit about some technologies that you could use, and tooling there. Then your model is out in the wild and you're going to be wanting to collect data and understand how it's performing. Eventually, you might notice that it's not performing the way that you still need it to. Different ways to think about collecting that data is creating different kinds of feedback systems, whether you're logging results, and have some automated way to see how that's doing. Whether you have maybe a person that goes in and checks every once in a while. For the overnight job model that I mentioned, that one, we actually had a clinical informaticist that would go in and either say, yes, this is good, and use it for training, or, no, this is not good, and use it for training, or, no, this is not good, and don't use it for training. There was this human feedback that we would use after the model had gone and done all these inferences for us and brought it back. Again, it's really specific to how you are deciding to deploy your model, what thresholds you're looking for, and what performance you're looking for. You do need to think about monitoring and making sure that you're still continuing to get good results.

Then once you notice that there's something off, whether that's through an alert, or through a human check every once in a while, or a unit test, or some threshold, whatever you've decided to implement for that, then you need to go retrain and redeploy. On the one that we're using, it's pretty standard, and it works pretty well as it is. I'm not really sure how much monitoring we'd really need. We're also doing inferencing on the client, so getting feedback about how it's doing would be difficult, because we'd have to actually send back that information. It doesn't mean we couldn't, but one of those things that we're liking is the offline and the data privacy of keeping the data on the device. In our scenario, we actually wouldn't really be collecting any of that data back. That doesn't mean that you wouldn't have a scenario that you would think that's important, and then you could figure that out. Or, maybe you have some logic within your application that checks on that and decides when, ok, we need to send this information back because it's important for us to consider this as an issue with the model that we might need to retrain. Then of course, you redeploy, publish your artifacts. This is your cycle of training, monitoring, deploying. Again, all of that monitoring and retraining process depends on the architecture that you've went with.


We've talked about high level architecture. We've talked about the training process, the deploy process, different considerations to make once you have your model working, a little bit about tooling. Now we're going to go a little bit more into tooling and specifically around how we need to additionally optimize this model that we have trained to work on the web. It's still too large, and we need to make it a little bit smaller. We are going to be using something called ONNX Runtime and the ONNX model format. Essentially, when you save out your model, or you export your model after it's been trained, you get a file. For this export, we're going to export it to a .ONNX file type, which is an open format that gives you a lot of things for free, like model portability, allows you to do inferencing in different languages, and also has different optimizations within it, as well as different optimizations that we can then add through the ONNX Runtime. There's two separate things here. We have ONNX, which we're going to export our model to an ONNX format. Then, we are going to leverage that ONNX model with ONNX Runtime.

Again, we're all about reducing those costs in inferencing. The more we can optimize and make things better, the less money it's going to cost us. Also, from an environmental standpoint, which we're going to learn a little bit about. Later on in this track we're going to learn a little bit more about the Green Software Foundation, and some things that you should be thinking about when you're deploying your model and using cloud compute. One way to do that is to optimize your model in a way where you're actually running for a shorter time. Not only is it going to save you costs, it's better for the environment. It's all good things.

The other thing about ONNX Runtime is it supports many frameworks. You can choose whatever training framework you want, then you export that to the ONNX format. Then from there, you can deploy to different devices, different hardware optimizations, and different execution providers, and all of the portability that you're going to be looking for, and the flexibility that you need to solve your problem. One of the things that ONNX Runtime has built in is quantization. Updating the weights to lower precision values, so you can get faster, smaller model sizes, but still maintain accuracy. Here are some metrics around different models that were able to do an Int8 quantization, make it four times smaller, while still maintaining pretty much the same accuracy. Some of the accuracy and F1 scores for the quantized models between both PyTorch quantization tools and ONNX Runtime.


Let's now take a look at how we are going to deploy that transformer model that we have now trained and fine-tuned. We are going to take it now and we are going to export it to an ONNX format, then we're going to use ONNX Runtime to quantize it. Then we're going to take a look at how we're doing that inferencing in JavaScript in the browser. We're back in our Notebook now where we left off as after our training. Now we're going to take a look at exporting our PyTorch model into that ONNX format and running it with ONNX Runtime Web, which is the package that allows us to do inferencing in JavaScript. First thing, we actually write in Transformers, we have the convert to ONNX built in. Here you can see that we are creating a pipeline with transformers for text classification, because we need both our model and our tokenizer in the pipeline. When you run a natural image processing model, the first thing that happens is we tokenize our text, so we turn those words into numbers. Then we process them in our model, and then we get back a result, and then we decode that into the answer.

Then we're calling that convert, and then convert PyTorch. We're sending in our pipeline. Then we're setting the opset. There's different operation sets that are part of ONNX Runtime that support different operators that you may have used when building your model. It does default, I think to 9. I'd have to double check the source to know exactly what it defaults to. I definitely recommend overriding it to the newest opset that supports the operators in your model. You can take a look at our GitHub if you want to see which opset. If you choose one that doesn't work, it will let you know. Then you can see that we're giving it that output and the use_external_format is set to false. Because I've already ran this, you can take a look over here and we can see that our classifier out is right here. Need to install ONNX Runtime here, and then from ONNX Runtime, you saw this clip in the slide but you know we're going to grab the quantization. We're going to call quantize_dynamic. We're going to send in our current model, the model that we want out, and what we want to quantize it at. From there, then we get back our quantization int8 model that we just created. If you take a look at the size difference from the classifier, you can see that this drastically reduced the size of our model. That's exactly what we wanted. Now we have a much smaller model that we can actually deploy and use in the web.

Import this, we're using the newest version. Then when you actually go to use ONNX Runtime, you create this InferenceSession, and you parse in your model. We now have a session for both the unquantized model and the quantized model. Now that we've created our session, we can run the input_feeds. Then when we do, we're going to send in our feed, and we're going to get a result. We ran two different models here, so that we can check the prediction accuracy, and see what the difference is now that we've quantized our model. We can actually see that we lost a little bit of accuracy, but for the performance gain from the reduction of size in the model, and the accuracy that it still performs at, we're ok with this. That is something, again, based on your scenario that you'll need to make a decision on and probably do some testing and figure out if the reduction in accuracy is still allowing your model to predict good enough.

Now that we have our model, we're actually ready to look at how we can deploy that into our application. Let's jump into our JavaScript code. You see here that we are importing our bert_tokenizer, which is actually just one that we've grabbed from TensorFlow. There, since the BERT model, we're able to just grab that one, we don't have to create it ourselves. Then we're going to add that into our JavaScript. The next thing we're going to add is the package, onnxruntime-web. This is what's going to allow us to do inferencing in JavaScript. There's packages in many languages like C#, and Java, and C++. There's packages in all types of languages that give you the flexibility to do inferencing in languages outside of the language that you use to train your model, which is pretty awesome. It's super useful in a lot of different ways. We are using Wasm. With ONNX Runtime Web, you can use WebAssembly, or you can use WebGL. Right now for operator support, Wasm has better coverage, so we want to use that. If you wanted to use WebGL, all you have to do is actually change this to webgl. The other thing is Wasm is only using CPU, so we aren't able to use the optimization that we might get from using the GPU. Again, for this model, it works just fine. Those are some considerations, again, when you're thinking about deploying.

We're getting our quantized model, which is right here. Then we're creating our InferenceSession. Just like we did in Python, where you saw that we created our session, and then we ran the session, it's the same thing in each language. Just like before, we're creating our InferenceSession, we're sending in our model, and our options here. Then we're just going to asynchronously load that. Then we're calling that loadTokenizer. We're setting up our label options. We have to encode the text in order to get that to run through our model. This is our JavaScript logic for encoding our text. Then we're going to take the session that we created with ONNX Runtime, and we're going to call that to run and send in our model input. We're getting the duration. As you saw in the demo, you probably saw that there was an inference amount of time that it took for each one, which is really useful when you're looking at, how long is this actually taking to get results? You could also do different things which aren't included here, like a debounce. Maybe you don't want to do an inference on every typing, you want to wait till they're done typing and then get less inferencing. Since we're doing inferencing locally, there's really not much of a reason I can think of why you'd want to do that. I would think that'd be more if you're going to be making those API calls out.

We're calling sigmoid on our results, and getting Math.Floor. We're doing absolute values. Sometimes you get negative values, and so there's different post-processing that you need to do for each model. That will change based on the model that you're doing. Also, sometimes that's the harder part when doing inferencing in other languages is you have to figure out how to do those pre and post-processing in the language that you're doing your inferencing in. Actually, with ONNX Runtime, we have some different templates that show you how to do the basic encoding and decoding for computer vision both with C# and JavaScript. Those are some templates that are available on our docs, to try to help with those pieces, but usually you're going to have to change it somehow for your model. Then we're just creating our list. This is a React app.

Since we are now ready to run, let's take a look, just call npm start, and this will start off our demo. We could just put down, "I like machine learning!" We can see that we have love, admiration, approval, joy, neutral, amusement. "I don't like the cold." We can see disapproval, annoyance, neutral, disappointment, anger. You can say, "I like the warm." Then you'll see actually that you'll get both disapproval and love as the top one. If you have a multi-sentence review, you can get average scores through that and see that it's going to pick up on the individual sentiments within multiple sentences in your result. There you go. That is how we were able to take a large transformer model, grab an open source distilled version of it, then quantize it, and use ONNX Runtime to do all of our inferencing on the edge.


We looked at different architectures. Then we learned how to use ONNX Runtime, and specifically the ONNX Runtime Web package. Be sure to check out, where you can find more information on how to use it, you can learn about the different packages we have for the different languages, and find samples as well as templates to get quick starts. We also have a YouTube channel. In our YouTube channel, we cover specific things of how to do X with ONNX Runtime, in short 5-minute bytes. Be sure to check that out and hang out with us there. We also have a LinkedIn and a Twitter to stay connected. Then here are the links to the model and datasets that we use. Then, also, if you want to check out the source code that we used, and check out a blog post on the demo that we did, those links are there. They were provided by our community member, Jo Bergum.

Questions and Answers

Polak: Will model on device degrade the device performance? Any way to limit the device resource usage?

Breviu: Yes. There's a few different things you can do. I talked about quantization. You reduce the size of the model, then it will take less resources to run it, and also you have to think about the memory size because that model is going to be downloaded onto the device. You can't do it with a huge model, you have to optimize your model in order to work on a device. If you can't do that, then you probably don't want to be doing inferencing on the device. One thing to really consider is just the overall size of the model, and what you're trying to do. There's some that I have that I've played with, even doing inferencing in gaming and things like that. I know that's really important. If you can optimize it, if you can use the GPU, so like in different packages, you can actually leverage the GPU, which is going to help it.

The other thing is you can reduce the amount of times you actually call inferencing. For example, like with JavaScript, you can do a debounce. You can wait until they've typed so many letters before you actually do an inference result. You're not doing as many inferencing. Then that toll isn't happening as much there. Think about how often you inference. Think about the size of the model. Those are really the main two things, and then leveraging the hardware that's there, in order to make your inference faster. Sometimes it helps too to break models up into multiple models, so that you can make them smaller. Here, we just took one model, and we made the model smaller. We had a distilled BERT model, and then we used quantization to make it smaller.

Another way that you could do it is maybe you separate them out into separate tasks and more of a binary classification of true and false. You can change the task around too, so you can get really creative. The answer is, it depends. Think about what you're trying to do, how can you get creative? How can you make it more efficient? How can you make the model smaller? How can you do less amount of inferencing, while still getting a good result?

Polak: It always reminds me of TinyML, because we can build models based on huge amounts of data, but at the end of the day, the model still can be tiny. It's always a case of, what do you really need? How is the model footprint on disk itself, or in memory?

Breviu: That's even with any model. Even if you're deploying to cloud hardware and you have A100s, or you have GPUs that can handle large inferencing, those are expensive. You're still going to want to probably mitigate cost. I think the more that you can optimize and reduce the size, while still getting your results, you're going to save money, you're going to save time, and that stuff. It's definitely one of those things in machine learning, where they're always fighting with each other. The bigger the model, the more accurate the model, the better the performance. You want the bigger model, but then you're fighting against the costs. That's one of the things that's great about doing on device inferencing is you're not paying for big hardware, in order to inference so you can actually save a lot of money. You could even do something hybrid, like maybe you're doing some inferencing on device, but then sometimes you pass it off to a cloud where they're more accurate. Maybe you have some threshold that you've set statically in your code, and if something doesn't get hit, you send that out to your bigger model. There's all kinds of ways you could think about putting things together to get the results that you need.

Polak: I noticed you also tapped into the environmental aspects of cost. Would you like to elaborate more about that a little bit?

Breviu: There are some really interesting things happening in software, when it comes to the environment, and what is our actual footprint on the different software that we're creating. Machine learning takes a lot of energy, because when we're training large models, they're expensive, and they cost money, but they're also putting out carbon, because that takes energy. Thinking about that side of it is, if I can make my models more efficient, I'm actually taking less energy, and therefore I'm actually better for the environment. I would look up the Green Software Engineering Foundation. They're doing some really interesting things. At ODSC, they had a talk, where they were talking about how they have started figuring out that they can choose when they pull power. This is outside of ONNX Runtime, but it's still relevant to green software. They can choose when they're actually pulling power from the grid to optimize to pull when there is more green power available. For example, if you're charging your phone overnight, you don't really necessarily care when it gets charged, as long as it gets charged. They're starting to put just a few lines of code in to say, ok, I'm going to charge at this time, because I know the grid is being powered by green energy.

There's all these interesting ways, but one of them with machine learning is thinking about when you're training those large models, or just how to reduce that to inferencing time and the environment. It's always good when you can be more efficient. That's one of the things ONNX Runtime does. You utilize more GPU in less amount of time, which actually turns into being better for the environment as well.

Polak: That's amazing that we can actually leverage machine learning to do that, but also make sure we're running machine learning efficiently, so it pays both ways.

Breviu: Thinking about the environment holistically in everything that you do, not just driving.

Polak: There was another question about ONNX support for PySpark ecosystem, the Apache Spark ecosystem? Is there anything on the roadmap or something that PySpark developers need to be aware of? Usually, it would be a Python library that they would use, something that people need to know?

Breviu: PySpark support where?

Polak: There are many libraries today to build machine learning models, and one of them is PySpark. It's in Python. I wonder if the ONNX model itself, there's an SDK in Python or something that we can work with and actually deploy the model using it.

Breviu: I think all of the main frameworks are supported. A lot of people get confused, they think ONNX and ONNX Runtime are the same thing, and they're actually two separate things. You can use the ONNX format. That's when you export your model after you've trained it. Then there's the runtime. There's actually different ones to ONNX, runtime isn't the only one. You're asking about converting the model to ONNX from PySpark?

Polak: Yes.

Breviu: I would assume that is. I've never used PySpark. I use PyTorch. Most of the time, Scikit-Learn and sometimes TensorFlow. PyTorch is my main gem. You've used PySpark a lot, haven't you, Adi?

Polak: Yes, I did, but I never tried with ONNX. This is why I was fascinated to see, we can actually leverage another platform, and then use ONNX to deploy it. Like one platform to build a machine learning model and then another one to deploy it.

Breviu: It looks like it is. It does. Pretty much, almost every main machine learning framework for building models is supported to convert to an ONNX. Also, there are a lot of tools that just have it built in, like Hugging Face has it built in, and PyTorch Has it built in. It's not even a separate package to convert it to an ONNX model. It's torch.export.onnx, that's how you do it in PyTorch. It's really widely supported, very highly used. We're actually doing a community day for the ONNX with the Linux Foundation coming up in June 24th, in the new Microsoft Silicon Valley office, so that'll be neat. It's virtual and in person, if you want to check out the cool things that are happening in the ONNX space, because ONNX really solves so many problems that data scientist and machine learning has, particularly when it comes to deployment and all cross platform things where you need to be able to optimize, and deploy in many different ways.

Polak: The whole MLOps world feels like we've been focusing for so many years on actually building the models. Now, as an industry, we finally understand that we need to support a larger infrastructure, is the MLOps. I don't think anyone ever put too much attention into it.

Breviu: They didn't. It started out with everyone was like, how do I build these models? Everything was about, how does machine learning work and how do I build it? Then I feel like collectively, people start figuring that out. It was interesting at ODSC, so many things were focused on the data side, because people also realize, ok, I want to build these, but my data is a mess and I can't actually do anything until my data is figured out. I feel like the data tools right now are really coming more forefront. Then the MLOps side, because even now that you have this model, getting it to production and getting it working well, and all these different things, as you saw. Because we just talked about all the different considerations that you need to do, all of that is being really a focus now as well. It's interesting to see how the focus of machine learning in general, has shifted from, how do I build this model? What is this operator? What is a convolutional neural network? To, how do I actually get data? How do I actually label data? How do I clean it, and do all these things? Then create these large pipelines in order to train a model, and then now I have this model, and I need to be able to inference on a mobile device, but it's too large. The new challenges that are coming up right now are really, I think, around MLOps, and that data prep, and data cleaning, and that side. That's where you focus on is that, the data side.

Polak: I remember the article from Google, when they said it was very clear. It's like, this is a tiny box of where you train your model, and here are all the different boxes that people don't pay attention to, like feature engineering, cleaning the data, deployment, model drift, all these aspects that was easy to forget.

Breviu: I think there are so many things in machine learning that were an afterthought, because we were always just so excited about like, look what we can do. At least that's my perspective. I feel like, there was all these really exciting things happening with models that we were able to create. It is still changing, and it is still happening. There are still so many new things that are coming out that we're able to do. It's so cool. Then I feel like the realization of making it a mature product and making it something that is repeatable and stable, was like, in order to make this actually good, we have to have the front and the back figured out. We have to have the data figured out. We have to have the MLOps figured out. We have to bring all those together in order to create something good.

Polak: What is the most exciting feature coming out for ONNX?

Breviu: One of the things that's interesting is the pre and post-processing side of machine learning. When it comes to building models, and you're doing it on Python, generally, or R, or Julia, but Python is the main one. There are so many great libraries in Python in order to do your preprocessing for your data. Preprocessing for your model is like, if it's an image, it's resizing it, it's getting out the RGB channels, and it's turning it into a Tensor, and all these different things that you have to do. It's really easy in Python. It is not so easy in other languages. When you say, now I'm going to inference in C#, you have to be able to do that preprocessing. We're working on tools to help with that to make it easier.


See more presentations with transcripts


Recorded at:

Nov 18, 2022