InfoQ Homepage Presentations Machine Learning on Mobile and Edge Devices with TensorFlow Lite

Machine Learning on Mobile and Edge Devices with TensorFlow Lite

View Presentation

Speed:

Download

44:12

Summary

Daniel Situnayake talks about how developers can use TensorFlow Lite to build machine learning applications that run entirely on-device, and how running models on-device leads to lower latency, improved privacy, and robustness against connectivity issues. He discusses workflows, tools, and platforms that make on-device inference possible.

Bio

Daniel Situnayake is Developer Advocate for TensorFlow Lite at Google and co-author of TinyML.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Situnayake: Today, we're going to be talking about machine learning on mobile and edge devices, specifically with a TensorFlow Lite flavor. I'll talk a little bit about what that is in a moment.

My name is Daniel Situnayake. I work at Google. I'm a developer advocate for TensorFlow Lite, which means I'm an engineer who works on the TensorFlow Lite team, but helps the TensorFlow Lite team understand and integrate with our community. I do stuff like building examples, and working on bugs that we find from our community. I'm also the co-author of this book, which is coming out in mid-December, called "TinyML." It's the first book about machine learning, specifically Deep Learning on devices that are really small. These are the models Wes mentioned that are 15, 20 KB, but can do speech recognition or gesture detection.

What is TensorFlow Lite?

TensorFlow Lite, which is what I work on at Google, is a production framework for deploying ML on all different devices. That's everything from mobile devices on down. I'll talk a little bit about some of those categories of devices as we get further along.

Goals

In our goals today, I want to inspire you to let you know what is possible with machine learning on-device, at the edge of the network. I also want to make sure we all have the same level of understanding of what machine learning is, and the things it can do, and how we do that stuff. Finally, I want to give some actionable next steps. If you're interested in this space, how do you get involved? Where can you learn more? How can you get started? I wanted to see right at the beginning, who has a background in machine learning? Who's heard of TensorFlow? Then, who has worked on edge devices? Maybe you're a mobile developer or even mobile, web, or embedded.

I'm going to be able to give an intro to ML. Presumably, you've all heard of ML, to some degree, but I'll cover the basics of what it is. Then we'll talk a little bit about ML on-device and why that makes sense. Then I'll go into some specifics of TensorFlow. Some of the hairy stuff I'll skip over really quickly, because it might not be relevant if you haven't used TensorFlow a lot already. We can always talk at the end about that, too.

What is Machine Learning?

First of all, I want to talk about what is machine learning. The easiest way to talk about this is, talk about what is not machine learning. The standard way that we build programs obviously is not machine learning. If I'm going to build a program, generally, I am writing some rules that apply to some data. I might write a function here. Here we're doing a calculation based on some data, and that happens through rules that we express in code. When that function runs, we get some answers back. The computation happens in that one place where we're taking the data, running it through some rules, and getting some answers.

Similarly, a video game works in the same way. There's some stuff going on in a virtual environment. There are some rules which apply whenever stuff happens. All these types of things that we're familiar with as engineers, in the past, generally use this type of programming. We're coming up with rule sets that handle stuff that happens in an environment. Pretty much what is going on is we create some rules and we create some data, and we feed them into this box, and out of the box we get answers. Machine learning screws this up a little bit. Instead of feeding in rules and data, we feed in answers and data. Then our box actually figures out how to return some rules, which we can then apply in the future.

Activity Recognition

Imagine we're doing the classical style of building an application to determine what activity people are performing. In this case, we can look at the person's speed. Imagine someone is walking. If they're going less than 4 miles per hour, maybe we can say that they are walking. If they're going greater than 4 miles per hour, or 4 miles per hour and above, their status can be running. Then maybe we can come up with a rule that says, "If this person is going even faster, faster than a human can run, they're probably cycling." Then, what do we do if they're doing something completely different? Our rules just don't work for this. They break down the simple heuristic that we've chosen. It doesn't make sense anymore. This is how the traditional programming model works. Let's have a look at how this might work in a machine learning application.

In this case, we have maybe a sensor that's attached to a smart device that the person is wearing. We're taking raw data from that sensor, and we're going to train a machine learning algorithm to understand that. In the case that the person is walking, we feed in the data for that, and we feed in a label that says they are walking into our model. We do the same for running, the same for biking, and the same for golfing. We have basically fed all of these things into this model that we're training. We've said, here is what the data looks like for walking. Here's what the data looks like for running. Our model can learn to distinguish between these categories without us knowing the exact rules and the exact heuristics that indicate those. The model figures that out for us.

Demo: Machine Learning in 2 Minutes

I want to give a quick demo of this in action. It's a live demo. I'm going to use a tool that we have released recently at Google called Teachable Machine. You can try this yourself. It's totally free. Teachable Machine is basically a studio for training your own machine learning models very easily, for prototyping experiences superfast. I'm going to do an image based project. What I'm going to do is do rock, paper, scissors recognition model. Each of the activities that I'm trying to classify is either rock, paper, or scissors is represented here. I'm going to make those. I got rock, paper, and scissors. Then I'm going to capture some camera data of myself doing the rock, paper, and scissors' sign. Does everybody know what rock, paper, scissors is? I'll do that via the webcam. I just need to capture a bunch of photos of myself doing this rock sign. I'm going to try that now. Here's my rock. I'm turning it all around. You can see it from a bunch of different angles. It understands not just one image of a hand, but generally, a hand rotated all around can still represent rock. I'm going to do the same for paper.

I don't really need that much data here. Let's do the same for scissors now. I've got less than 100 samples for some of them. The rock has a few more because I was talking while I was doing it. It doesn't really matter. What we're going to do now is train a model that uses these images and the labels to understand what is meant by rock, paper, and scissors gesture. What happens during training is very complicated. I'm not going to go into it now. There's a lot of literature and a lot of interesting stuff online that you can read about how this works. Essentially, what we're doing here is we're taking a model that was already trained on vision. It understands how to break apart a visual scene into lots of different shapes, colors, and objects, and things. Then we take that pre-trained model and we can customize it a little bit so that it specifically understands what the rock, paper, and scissors' gestures mean. Right now I'm not doing anything, so it's oscillating wildly between the three. Let's see if I can do rock. We got really high confidence there. If I did paper, also works. Then scissors, scissors is a little bit harder to discern from paper, but yes, there we go, it's working pretty well.

This is how machine learning works, you capture data, label it, feed it into a model during training, the model gets adjusted so that it can do this stuff in the future. Then you get something pretty robust, potentially pretty quickly. The technology to do this, the basic concepts has been around a while, but to be able to do this reliably, and so easily and so quickly. This stuff's only been figured out in the last five years or so. It's pretty exciting.

Key Terms

I want to cover some key terms just so that we're able to talk about this stuff fluently. The first thing that I'll define is a dataset. A dataset is the data that we're going to be feeding into the model during training. That includes, in the case I just showed, the photos of me doing the gesture, and also the label.

Training is the process of taking a model. A model is basically a bunch of data structures that are woven together in a certain way, and gradually adjusting that model so that it is able to make predictions based on that dataset.

The model itself, at the end of training, can be represented either as arrays in memory, or as data on disk. You can think of it as a file. A package of information that contains a representation of how to make the predictions that we train the model for. That's portable. You can take it from device to device and run it in different places.

The process of running the model is called inference. When you take a thing the model hasn't seen before, and you run it through the model and get a prediction. That process is called inference. That's separate from training, which is where you take some labeled data and teach the model how to understand it. Those are the two main parts of machine learning: training and inference.

What I'm going to talk about today mostly falls into the bucket of inference, because inference is the thing that's most useful to do on edge devices. Training usually takes quite a lot of power, quite a lot of memory, and quite a lot of time. Those are generally three things that edge devices don't have. We're really mostly talking about inference here. There are some cool technologies for doing training on edge devices. We'll talk about those later if anyone has any questions.

What inference looks like in an application is this. First of all, we have our model and we load it into memory. We then take our input data, and we transform it into a way that fits the model. Every model has different input parameters. For example, the model we just trained here, the model would have had a fixed input size. It takes an image with a certain number of pixels. If we've got data from a camera that has a different resolution, we want to transform that so it fits the size of the model. We have to do that generally too. We then run inference, which is done by an interpreter, which takes the model, takes the data, runs the data through the model, and gives us the results. Then we figure out how to use the resulting output. Sometimes that's very easy. It's just some category scores in an array. Sometimes it's something that's a little bit more complicated, and we need to write some application code to make it clear what's going on.

Application Code

To show you all the parts of this application, a typical ML application, first of all, we have our input data that could be captured from a sensor, or a device, or it could be just data that exists in memory somewhere. We then do some pre-processing. We're getting it ready to feed into the model. Every model has a different format that it expects. That's just defined by whoever created the model. We then load the model and use an interpreter to run inference using the model. Then we do some post-processing that interprets the model's output and helps us make sense of it in context of our application. Then we can use that to do something cool for the user. TensorFlow Lite has tooling to do all of this stuff. It has components that cover every aspect of this process, and that you can use to easily build mobile and embedded applications that use machine learning.

In our exploration of TF Lite, we're going to go through an intro. We'll talk about how to get started with TensorFlow Lite, how to make the most of it once you start using it seriously. Then we're also going to talk about running TensorFlow Lite on microcontrollers or MCUs, which are the tiny devices that power all of our gadgetry that is smaller than a mobile phone or embedded Linux device.

Edge ML Explosion

The thing doing ML on edge devices is threefold. First of all, if you do ML on a device, you have lower latency. The original model, from a couple years ago model of ML, is that you have some big, crazy ML model running on a big powerful server somewhere. If you want to do machine learning inference, no matter what it's for, you send your data up to that server, and the server does some computation and sends you the result back. That results in pretty high latency. You're not going to be able to do nice, real-time video or audio stuff in that case. All your interactions are going to have some latency and you're going to have to worry about things like bandwidth. Network connectivity is an issue if you're trying to do inference that is not on-device. This comes down to bandwidth and latency.

The other thing is, if we're able to do ML on a device, then none of the data needs to go to the cloud. That's much better for the user and it's much better for you as a developer, because you don't have to deal with the hairy issues surrounding user data.

If you can do away with some of these problems, you're able to build a whole generation of new products that weren't possible before. If we're thinking about medical devices that can operate without ingesting loads of user data, or devices that are doing video modification in real-time. This is an example of something which you'd really struggle to do with server-side ML. On the device here, we have a model that's doing facial landmark detection. It can pick out where your eyes, ears, mouth, nose are. That's allowing the app developer to add some animations and add some features onto a photo. If we're trying to do this over a network connection, it would be really laggy and slow, and you wouldn't have that great an experience. Whereas here, running on a phone, it works really nicely.

Another example of this is we're doing pose estimation. This is a type of ML model that takes an image of a person as an input. It's able to figure out what their different limbs and body parts are, and give you coordinates for those. In this case, the kid's able to get a reward in the app for having their dancing matched up with the dancing in the little video, this in-set. This is another thing where you need super low latency. It has to happen on-device. Also, you probably, if you have built a game or a toy, don't want to be streaming loads of data. It's bad from a privacy perspective. It also means you have to spend a lot on bandwidth. In this case, it's super suited to Edge ML.

Here's another use case. Imagine you're on vacation somewhere, or you're reading a book in a foreign language, and you want to look at definitions for words, and you maybe don't have good connectivity. This is a really good example of another place where an edge model makes sense, because you can do all stuff on-device that maybe you wouldn't be able to do if you didn't have an internet connection.

1000's of Production Apps Use it globally

There are thousands of apps using ML and using TensorFlow Lite for Edge ML at the moment. Google uses it across pretty much all of its experiences. Then there are a bunch of big international companies that are also doing really cool stuff. There are 3 billion-plus mobile devices globally, that are running TensorFlow Lite in production. This is a really good platform, a really good tool to learn if you're interested in doing this type of thing, because it's already out there. There are a bunch of guides and a bunch of examples of how to use things. It's battle tested by some of the biggest companies.

TensorFlow Lite - Beyond Mobile Devices

Beyond mobile devices, TensorFlow Lite works in a bunch of different places. Android and iOS are obviously big targets. Another place is embedded Linux. If you're building stuff on things like Raspberry Pi, and similar platforms, you can use TensorFlow Lite to run inference on-device in maybe an industrial setting. Or we've seen people doing things on wildlife monitoring stations that are set up in the jungle somewhere, or places that are disconnected from a network, but you want to have some degree of intelligence. We also have support for hardware acceleration. This category of devices that are basically small embedded Linux boards with a chip on board that is dedicated to running ML inference really fast. There are chips from Google. We have this thing called the Edge TPU, which lets you run accelerated inference on these devices. NVIDIA has some similar products. There are a ton of them that are on the way.

Our final target is microcontrollers. Microcontrollers are a little bit in a class of their own here because they have such smaller access to resources. In terms of memory and processing power, they're vastly smaller. They might have a couple 100K of RAM, maybe a 48 MHz processor. They're designed for very low power consumption. We're able to run TensorFlow Lite on those. We can only run much smaller models, but you can do some really exciting stuff still. We're pretty much running the gamut from really powerful little supercomputers like mobile phones, all the way down to tiny, little microcontrollers that cost a couple of cents each.

I want to talk a little bit more about on-device ML. The previous model is that we have an internet connection to a big server that's powerful and running ML. The device is located in an environment where it's collecting data, and in order to run inference on that data and do anything intelligent, it has to send that data back. With ML on the edge, the only part of this system that exists is just the connection between the environment and the device. You don't have to worry about other connectivity. That helps you with bandwidth. You're not sending lots of data everywhere. It helps you with latency. Things like music or video. You can actually build applications that work that have lower latency that humans are able to perceive. It's much better from a privacy and security perspective, because you're not sending people's data anywhere. It rules out a lot of complexity. You don't have to maintain this back-end ML infrastructure that you might not have any experience with. Instead, you can just do everything on-device.

There are some challenges that come with this. One of them I mentioned a few times are you might not have access to much compute power. The extreme case is these tiny, little microcontrollers, with very little memory and processing ability. Even a mobile phone, you have to think about how much computation you want to do in order to preserve battery life and things like that. You might not have a lot of memory no matter where this thing is running. Battery is always really important. Whether it's on a smartwatch or an embedded device, you're always going to think about power.

TensorFlow lite is designed to make it easier to address some of these issues. The other big thing it allows you to do is take an existing ML model, convert it for use with TensorFlow Lite, and then deploy that same model to any platform. Whether you want to be running on iOS, or Android, or on an embedded Linux device, the same model will be supported in multiple places.

Getting Started with TensorFlow Lite

I want to talk a little bit about how to actually use TensorFlow Lite. I'll probably go through this at a fairly high level. There's a lot of documentation available. I just want to give you a taste of the level of complexity here for an application developer. The first thing I want to do is show you an example of TF Lite in action in a bigger experience. This year for Google I/O, we built an app called Dance Like. Basically, it's a fun experience built on TensorFlow Lite that uses a bunch of chained together ML models to help you learn to be a better dancer. I will show you a quick video about how it works.

Dance Like

Davis: Dance Like enables you to learn how to dance on a mobile phone.

McClanahan: TensorFlow can take our smartphone camera and turn it into a powerful tool for analyzing body pose.

Selle: We have a team at Google that had developed an advanced model for doing pose segmentation. We were able to take their implementation, convert it into TensorFlow Lite. Once we had it there, we could use it directly.

Agarwal: To run all the AI and machine learning models. To detect body parts. It's a very computationally expensive process, where we need to use the on-device GPU. TensorFlow library made it possible so that we can leverage all these resources, the compute on the device and give a great user experience.

Selle: Teaching people to dance is just the tip of the iceberg. Anything that involves movement would be a great candidate.

Davis: That means people who have skills can teach other people those skills. AOA is just this layer that really interfaces between the two things. When you empower people to teach people, I think that's really when you have something that is game-changing.

Situnayake: To give you a sense of what this looks like in action, the way it works is that you dance alongside a professional dancer who's dancing at full speed. You dance at half speed, then we use machine learning to take your video and speed it up and beat match you with the dancer's actual movements, and give you a score for how well you did.

We've made it Easy to Deploy ML on-device

The whole idea of TensorFlow Lite is to try and make it easier to deploy these types of applications. You can focus on building an amazing user experience without having to focus on all of the crazy detail of managing ML models and designing a runtime to run them.

There are four parts to TensorFlow Lite. One part is that we offer a whole bunch of models that you can pick up and use, or customize to your own end. Some of the models we've seen already, like the pose detection model, they're available. You can just grab them and drop them into your app and start using them right away. We also let you convert models. You can take a model that you've found somewhere else online, or that your company's data science team has developed, or that you created personally, in your own work with ML, and basically translate that into a form that works better on mobile. You then have the ability to take that file, deploy it to mobile devices. We have a bunch of different language bindings and support for a bunch of different types of devices. We also have tools for optimizing models so you can actually do stuff to them that make them run faster and take up less space on-device.

Workflow

The workflow for doing this is pretty simple. First, you get a model. Secondly, you deploy it and run it on devices. There's not much more to it than that. I'm going to show how that works. First of all, you don't have a model. You don't even know what one really is. You can still get started really easily. On our site, we have this index of model types. You can go in, learn about each one. Learn about how it might solve the problems that you're trying to solve. We have example apps for each of those so that you can actually see them in iOS and Android apps running. That covers everything from image classification, where you're figuring out what's in an image, all the way through to text classification using BERT.

An example of image segmentation here, that's when you're able to cut the foreground and background of an image, or you can basically figure out which pixels in an image belong to which objects. In this case, we're figuring out some parts belong to a person. In the left-hand part, we're figuring out what the background isn't blurring it so it looks like a pro photo. In the second one, we're letting the user replace the background entirely so they just look cool.

The second model, this is PoseNet model for figuring out where your limbs are. You can use that data as a developer to do loads of stuff. Maybe you're drawing stuff onto the screen on top of people's body. You could also take this data as input to another Deep Learning network that you develop, that is able to figure out what gestures they're doing, and maybe what dance they're doing, or what moves in a fighting game.

MobileBERT

A really cool thing that we just launched is MobileBERT. This is a really small version of BERT, which has almost as good accuracy and works on mobile devices. BERT is a cutting edge model for doing various different classes of text understanding problems. In this case, we can put a corpus of text. You can see there's a paragraph of text here about TensorFlow. The user can ask questions about it. The model picks out parts of the text that answer the question. You can paste in any text you want. It could be FAQs for a product. It could be a story. It could be a biography. The model is able to answer questions based on it. You can just weave that into your application for whatever thing you want to do with it.

Beyond the models that we're giving away, and we're actually adding more and more all the time, so we have a whole team devoted to just doing that. We also support all different types of models that you can run through the TensorFlow Lite converter and use on your mobile device. These are some of the ones that we've identified as the most exciting from mobile application developer's perspective. You can convert pretty much any model.

We'll talk a little bit about how that works. Imagine you've built a model with TensorFlow. If you've never used it before, basically, there are some high-level APIs that let you join layers of computation together to build a machine learning model, and then train it. It's actually pretty easy to use. You can get up and running really quickly. There are some really good guides online. Once you've done that, you can just convert your model to run on mobile with a couple lines of Python.

Once you've got your model ready, we want to run it. Running it is also super easy. In this example, first of all, we're loading a model file, and instantiating an interpreter with that model. We're then processing our input to get ready to feed into the model. Then we run the interpreter with that pre-processed input.

The New TF Lite Support Library Makes Development Easier

We also have this support library, which actually can provide you with high-level APIs, and eventually, auto-generated code to be able to pre-process your data and feed it into whatever type of model that you want. You'll be able to find a model online and run it through the support library. The support library will generate some classes for you that you can drop into your application that will do all of this pre-processing work for you. You really just think of it as an API to get the results of the inference.

This is what the pre-processing code looks like, without using the support library, for transforming an image into the form that it needs to be for the model to understand. With the support library, it turns into a few lines of code. This is pretty awesome. It makes it a lot easier to use random models that you've found without having to really deeply understand the input format that they require.

We talked about the converter and the interpreter, and these make use of another couple of high-level things called Op Kernels and Delegates. We'll talk about those a little bit more later.

Language Bindings

We have language bindings for a ton of different targets. You might be working on iOS, or Android, or on embedded Linux. We've got you covered. We have Swift, Objective-C, C, and C#, Rust, Go, Flutter. You basically either have libraries supported by us or by the community for pretty much anything you can think of.

Running TensorFlow Lite on Microcontrollers

I want to also talk about microcontrollers, but a little bit separately because we have two interpreters. There's an interpreter that runs on mobile devices. Then there's a super efficient, super handcrafted interpreter that runs on microcontrollers, because they need such efficient code.

Microcontrollers are these tiny computers that are on a single piece of silicon. They don't have an OS. They have very little RAM. They have very little code space as well for storing your program. You can't put really big models on there. You can't do loads of computationally intensive stuff quickly. They are built into everything. They're really cheap. There are actually, I think, 3 billion microcontrollers produced every year in all types of devices. By being able to add Deep Learning based intelligence to all things, we're talking about having stuff like microwave ovens, and kitchen appliances, and components inside of vehicles, and even smart sensors that can just take an arbitrary input and give you a very simple output that you can then build into other products.

An example of how you might use microcontrollers for inference is, maybe you've got a product that is figuring out what a person is saying. Imagine you're building a smart home device that can understand speech. You want it to use as little power as possible. You might have a Deep Learning network running on a microcontroller that, first of all, figures out if there's any sound that sounds like is worth listening to. When that sound happens, the output of that model is used to wake up a secondary model, which is actually looking to figure out whether the sound is human speech or not. That's something that would be difficult to do without Deep Learning. Once you've figured out that, yes, this is human speech we're hearing, you can wake up the application processor that actually has a deeper network that does speech recognition. By cascading models in this manner, we're able to make sure we're saving energy by not waking up the application processor for every little noise that happens. This is a really common use case for this type of technology. TensorFlow Lite for microcontrollers, you use the same model, but there's a different interpreter, and the interpreter is optimized very heavily for these tiny devices.

This is a tiny little microcontroller using very little power. The whole cost of the actual microcontroller itself would be a couple of dollars. The camera is also very low power and cheap. It's able to detect whether a person is in the frame or not. The way we've got this set up, there's a display. This can actually be boiled down into a tiny device that's the size of your fingernail, which you could put in any product. It can give you a Boolean output of, if there's a person visible near the device, it gives you a 1, and if there's no person visible, it gives you a 0. It uses barely any power. You don't have to know anything about machine learning to be able to use this thing. You just put that in your hardware product. You can have a TV that shuts off when no one's watching it automatically for the cost of a couple of extra dollars of the manufacturing cost. These smart sensors are going to absolutely transform the world around us. This type of technology has only existed for a matter of months. TensorFlow Lite for microcontrollers was announced in February of this year. We've not even remotely started to see the applications people are developing. If you have any interest in embedded development, you should definitely start playing with this stuff because it's really fun and surprisingly easy.

Here is another video from our partners at Arduino. We're able to run TensorFlow Lite for microcontrollers on the most recent Arduino devices. They've got some tutorials for doing all cool stuff like recognizing gestures or recognizing objects from the sensors on-device. You can actually just grab the examples for Arduino from within the Arduino IDE because we've published a library that's really easy to use.

On an MCU, we can do everything from speech recognition through to interpreting camera data, we can do gesture recognition using accelerometers. We can do stuff with predictive maintenance where you're looking at the vibration of an industrial component to figure out when it's going to break so that your whole factory doesn't explode. This is all stuff that is really exciting because you can push intelligence down close to these sensors, and do this type of inference really cheaply.

Speech recognition, we have an example for. There's a 20 KB model that can discern between the words yes and no. We also have scripts you can use to retrain it for other stuff. This is really exciting to just play with.

Person detection is my favorite, really, because it's just so mind blowing. You have a tiny, little camera. The model is 250 KB. It won't fit on every embedded device, but it will fit on some tiny devices. You can run scripts to retrain this easily to recognize other objects. If you want to build a smart sensor that maybe on your bicycle will tell you when there's a car coming up close behind you, you can do that super easy.

We have an example where you have a device with an accelerometer. You can use it as a magic wand. You can do different gestures and cast different spells. We built a game that lets you do that. Obviously, there are some practical implications for this too in activity trackers. The model for that is also really small, 20 KB, and it's trained with data captured from 5 people. It probably took an hour to capture all the training data. It's super easy to really be able to build something powerful.

Improving your Model Performance

Beyond microcontrollers, and across all of these Edge ML stuff, you have to think about, how do you make models that perform well on tiny devices? We have all the tooling to do that, too. The big thing for TensorFlow Lite is performance across different types of devices. We've got really good performance across a bunch of different types of accelerators. If you're just running on CPU, went pretty quick. If your device, like most mobile phones, has access to a GPU, you can run models superfast because they involve the calculations that are highly parallelizable. If you have a hardware accelerator like the Edge TPU, you can run inference ridiculously fast.

There are also a bunch of techniques you can use to improve the performance of your model. The TensorFlow Lite converter can help with all of these. It can do things that make your model smaller and make it run better on different types of devices. If you're on CPU, you can do something called quantization, which basically involves reducing the precision of the numbers in the model while taking that into account during inference. By default, the numbers in a TensorFlow model are represented as 32-bit floating points. If we reduce them down to 8-bit integers, we can make the model a fourth the size but still keep most of the same performance.

Pruning

Pruning is another really cool technique. You're able to identify, the model is basically a network of neurons, some of the connections between the neurons are very important, and other ones are not so important. If you figure out which ones are not that important and cut them and just ignore them, you don't have to represent that data anywhere. You don't have to do that computation. We have the tooling that lets you do that, so the model can basically run more efficiently without any reduction in accuracy.

We have a bunch of really low-level stuff that you can get into the weeds around to figure out how to do this stuff more efficiently. I won't go into a lot of detail here. We have mechanisms for making use of the types of accelerators that are in all devices, from mobile phones through to these specialized accelerators. GPU Delegation is one of those. You can run a model on the GPU of your device. You can also make use of DSPs which are purpose-built chips that are inside of a lot of devices that allow you to do really fast calculations on these type of stuff. We made use of Android's ANNI, and also Metal on iOS. It's really easy to do this. You can add an option to your interpreter to tell it to use acceleration.

Here's an example of how you can basically optimize models for different types of uses. Inception by itself is a 95 MB model for image classification. Google developed MobileNet, which is a model that does exactly the same thing, almost as good accuracy but much faster. When you're thinking about deploying models to your device, there are often mobile-optimized versions of different types of popular models. You should look for those and use them if available.

We have tools for profiling how long it takes to do various stuff. You can even do that down to the operator level. ML models are built out of these different operations and you can identify which ones are taking the longest. If you're trying to run a model on a device, you can figure out which parts of the model are not running fast, and you can work with your ML engineers to optimize that.

There are also ways to use models that are not fully supported by TF Lite. TensorFlow Lite has a few hundred ops. TensorFlow has 1,000 or so ops that you can use. You can import those ops from TensorFlow to use in your mobile applications. You can also selectively modify the TensorFlow Lite runtime so it only uses the ops that your model uses. This is another way to get your binary size as small as possible. You can set some stuff up in your code, and then use stuff like Android Build-Tools to tell it to only grab the stuff it needs.

How to get started

Hopefully, I've covered some very high-level ML stuff. I've talked about on-device ML and why it's called, and why it should exist, and how it's going to make a big difference in the future. Also, I've given you an idea of what you will start to think about when you start working in this stuff as an engineer. Let's point to some resources that you can use to get started.

We actually just launched a course with Udacity that covers TensorFlow Lite, end-to-end. If you're interested in getting started with TF Lite, definitely search for this, check it out. We cover inference on Android, iOS, and Raspberry Pi. If one of those platforms interests you, you can ignore the other stuff and you can pick and choose what you want to learn.

If you're interested in the microcontroller side, I've just co-authored this book with Pete Warden who is the guy on our team who pretty much helped invent this space. This book is going to be available in mid-December. If you are an O'Reilly subscriber, you can read the early release version already. We basically give you an introduction to how embedded ML works and how you can use TensorFlow Lite to work with that. It's written so that even if you're not an embedded developer or if you're not an ML developer, you can still build all the projects in the book.

If you're especially interested in these embedded stuff, we run monthly meetups on embedded ML. There are two right now. One of them is in Santa Clara, and one of them is in Austin. We are actually launching more all the time. We get an inbound request from someone who's interested in starting up a group every couple weeks. There are going to be these all over the world. If there isn't one in your local community now, there will be soon. It's really cool to go and meet people, and see presentations from people who are doing cool stuff in this space.

The main place to go for info on TensorFlow Lite and all the stuff we talked about is our TensorFlow Lite doc site. We've got information on everything that I've talked about today. We're actually going to be revamping the docs over the next month or so, just so that they are even more inclusive of all the information that you need.

See more presentations with transcripts

Recorded at:

Apr 28, 2020

Daniel Situnayake

InfoQ Software Architects' Newsletter