Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Swift for Tensorflow

Swift for Tensorflow



Paige Bailey demonstrates how Swift for TensorFlow can make advanced machine learning research easier and faster.


Paige Bailey is the product manager for Swift for TensorFlow at Google. Prior to her role as a PM in Google Brain, she was developer advocate for TensorFlow core; a senior software engineer and machine learning engineer in the office of the Microsoft Azure CTO; and a data scientist at Chevron.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Bailey: Thank you for coming here today to learn a little bit more about Swift for TensorFlow. My name is Paige Bailey. I've been doing machine learning for a little bit over a decade now, though I started with traditional machine learning before going into deep learning specifically. When I started doing it, it wasn't called data science or machine learning at all, it was called research. You just had massive amounts of data that you were looking for statistically relevant samples and then trying to use that data to understand and to make predictions about behaviors and systems in the future, potentially.

I worked at NASA for a couple of summers doing research there on lunar ultraviolet. I'm currently at Google working with Chris Lattner, who's responsible for the Swift programming language, LLVM, a whole bunch of work on Clang. My background is not machine learning at all, its geophysics and applied math. Started off doing work in the oil industry until I realized that people would also pay me money to program computers, and that was a bit more delightful. I've also had no experience with Swift before working on this project. It's been nice to be able to transition back to the structure of a typed language as opposed to working with Python, though Python does have a number of benefits, which we'll touch on in this presentation as well.

An outline for today, we're going to talk a little bit about what Swift for TensorFlow is and what it is not. It is not just another language binding for TensorFlow, it is actually a reimagining of the complete framework. That gives us a number of benefits, particularly in production settings, and particularly in terms of performance. You might be interested to learn a little bit more there. We're also going to talk about why Swift, in particular, is useful for deep learning and give you some resources to how you can get started if you want to begin programming with Swift, especially in a machine learning context.

Machine learning and deep learning, in particular, have exploded in popularity over the last several years. We've seen models starting to be integrated into every possible deployment target, so in browsers or in mobile devices, and this is obviously true in computer vision tasks. I think everybody's familiar with Hotdog, not Hotdog situations, but also in really interesting language tasks. BERT is a great example of a model that's been very popular in academic settings. It was originally released with TensorFlow on TPUs, and then almost immediately had had another implementation running on TPUs.

It's now being integrated into pretty much every product that we have at Google, as well as a number of products at Microsoft and other technology companies as well. It's great in that all of these models and all of these frameworks are open-source, so folks can use them to meet the needs of their own businesses without necessarily having to rewrite the wheels. Instead of having to hunt down the massive amounts of data that would be needed to train, at Google scale, you're able to take the model that has already been trained and minutely tweak it to meet your use case.

We've also seen an explosion in terms of academic contexts. The number of machine learning papers produced every day, as of 2017, has surpassed Moore's law. If I extend it out this timeline to be 2008 and 2019, it would be even more impressive. The publication count for conferences is getting to be a little bit crazy. There's a conference each December called NeurIPS, that it sold out tickets in something like seven minutes last year. You would think that we had Beyonce coming to sing or something. It was really dear to get a ticket.

This year, it was so popular that every person who wanted to register for the conference had to be put into a drawing pool to be considered for the privilege of being able to pay money to go to this conference. It's been awesome to see the interest and to see people get more and more enthusiastic about building and deploying machine learning models. Also, as a person who's been doing this for a little bit over 10 years, it's a bit overwhelming in the sense that it did not use to be cool, and now it is, and most of us are still trying to figure out how to cope.

We've also seen really impressive accuracy improvements over the years. These step-change improvements are brought on not by those minute tweaks that I was talking about before, so somebody releases BERT, you change the hyperparameters, or you have some novel way to understand embeddings or something. It's mostly through step-change improvements and these algorithmic fusion techniques, so being able to have some neural network that you marry with like Monte Carlo Tree Search, and you put these two together, and then suddenly, voila, you have a 5% performance improvement.

This is great in the sense that you keep seeing these really interesting blends of more known and more proven technologies with this new stuff. It also means that you need to be able to have a language that's capable of marrying those techniques, even though a framework hasn't really been solidified yet. For example, a number of the high-level APIs that are used for deep learning projects there for very well-defined out-of-the-box tasks, but they aren't really good at being stitched together and composed together and integrated effectively.

Google is also an AI-first company. As we build out these models in Google Brain, we want to be able to deploy them as quickly as possible. That's not just for browser-based applications like Gmail or like YouTube, but also for things on mobile devices. If any of you have a pixel device, you might have noticed just recently that speech to text, being able to speak into your phone and have it immediately transcribed in English has gotten much faster. That's because we've been able to take models, quantized them, and then placed them directly on mobile devices with superfast inference speed.

TensorFlow is running on every single one of your pixel devices as well as every single Android device, which is amazing. The crux is models are getting more and more ubiquitous, the deployment scenarios are getting more and more novel and complex, and performance is becoming increasingly important to the people who are building applications as well as the people who are using them.

We've also got this cool stuff for accelerated hardware in terms of training. What you see on the screen here are a collection of TPUs that we've produced over the years at Google as well as TPU pods scenario. These are really powerful, giving us the ability to take a model that would have historically taken weeks to train to be able to train it in a matter of minutes or hours. Whenever you have this as a deep learning engineer or machine learning engineer, you suddenly unlock all of these new possibilities. Instead of having a rapid prototyping cycle that isn't super rapid, like you have an idea and then three weeks later, you figure out if it worked or not, or more frustratingly, you have an idea, you attempt to prototype it, and then you realize you made some sort of code mistake, and you find out three weeks later, frustratingly, and have to restart all over again. You're suddenly getting a much, much speedier feedback loop. You can try something, go and get a coffee, figure out if it worked, try something else, debug it a little bit, and then get a new scenario.

All of this, it is incredibly important that whatever language you use can communicate directly with TPUs and that can be supported by something – TPUs use a special format called bfloat16. Being able to work with that effectively and being able to directly interact with the architecture is increasingly important.

Swift for TensorFlow

Here, I get to the point. Why Swift for TensorFlow? All of these constraints and frustrations that I've mentioned? How is Swift uniquely positioned to help machine learning engineers and their production teams be effective at deploying their models?

For this, I like to give an anecdote of an example. This is from a book called "Python Interviews." It details the story of a startup. Google, in its early days, was really continuously frustrated that this startup was able to have an idea and really rapidly prototype out a solution and deploy it before Google was even able to get to the first step using C++. It was like they kept getting beat to market over and over again. It wasn't until they acquired this company called YouTube and understood their code base, that they realized that the reason that this company was able to so quickly iterate and have a fast path to deployment because they were using Python. They weren't using C++.

It was just that simple language choice that made them so effective and made the deployment speed so rapid. This is great if you're operating at a high level. What we see increasingly with these algorithmic fusion techniques is that people want to experiment with distributed training. With reinforcement learning, it's a given that you need to be able to have low-level control over the distribution techniques for your hardware. You can't do that really effectively at a high level, you have to get a little bit lower. If you get lower, you're probably using C++, and C++ isn't friendly. It is not a fun language to use. It has a number of affordances, which are really powerful. It's performance, you have types. Python has introduced types as part of Python 3, but it's still a little bit frustrating. Regardless, if you want to experiment at these lower levels, you need something that's just as performant, just as powerful, and just as interoperable as C++. Historically, there hasn't been an option until Swift. If you're wanting to try these more novel deep learning step-change improvement techniques, Swift is a great tool for you to explore.

What does it look like, you may ask. You're telling me it's just as speedy as C++. You're telling me that it has C++ interop, or C interop, or Python interop? What does it actually look like? The good news is, it looks a lot like Python. Coming from the Python world myself, you might see a few more "lets" sprinkled in. You might see some curly braces. You might see some indication types. Other than that, if you squint a little bit, it looks like Python, which is really exciting. It's very readable. It's open-sourced, even though the language was started at Apple. It's community-driven.

It gives you the ability to quickly prototype performance, production-ready solutions very quickly. If you want to implement a model, it looks a little bit like this. An image classification model, if you import Swift for TensorFlow, you would create a model with a few layers. You have here convolutional 2D layer, max pooling, and some additional features. You would preface the function with @differentiable. With Swift for TensorFlow, any function is differentiable, regardless of whether it's related to machine learning or not. You could create any arbitrary function and be able to understand the change over time. Then you would set up this TensorFlow context, and you're off to the races.

It's actually really interesting to explore training a model. You have an optimizer, you have a couple of distributions, and then you just have a for loop where you can apply your gradients and your softmaxCrossEntropy. If you look at the difference between the Swift implementation of a model and the Python implementation, on the bottom, you have something from TensorFlow historically. You probably have seen this canonical MNIST example from T.F. Kerris. It looks almost the same.

It's actually funny. The Swift implementation has one line fewer than the Python implementation, at least in this example. This slide is taken from a course that was delivered by Jeremy Howard and Chris Lattner just recently. Jeremy Howard is one of the creators of with Rachel Thomas. They've done great work building out really understandable high-level APIs in Python. The next iteration of Jeremy's course is taught in Swift, and he's actually re-implemented all of and Swift with a new name,, which is delightful and also very punny, which I appreciate.

Implementing a model in Swift, implementing a model in Python looks pretty much the same, but dramatic performance and deployment considerations.

Another nice thing about Swift. What you see here is an example of a Swift function. It's the same code implemented in assembly. You see here that it's very succinct, very clean. Whereas if you were implementing similar functionality in another language, it would certainly not fit on a single slide. It's great to be able to understand precisely what is happening whenever you use Swift to implement some function. Swift often gets the "Qdism" of being syntactic sugar for LLVM. This is a great example of how compiler technology can really build powerful and performing code.

Why Swift for Machine Learning?

Why Swift for machine learning? I've already explained a little bit about this. In case you weren't aware, Swift is cross-platform. This surprised me when I first started. You hear Swift, and you're, "Yes, Apple thing." It goes on the things that start with the "i." It goes on the iPhones, and the iPads, and the such, and the MacOSs and whatever. Actually, it can go anywhere C++ can go. That means Linux, macOS, Windows, we do have support on all of those platforms. It also means Android and iOS devices, as well as embedded devices. There was recently a master's thesis proving that you could get the Swift binary size down low enough to go on M-series embedded devices, which is really cool, and also has a number of applications when you start things thinking about sensor technologies being enabled by deep learning models and things like factories or in cars.

It's also very syntactically similar to Kotlin. If you're an Android developer, and your alternative is, do you want to have a model that you deploy in your app and it's implemented in Python? Or, do you want to have something that looks a little bit similar to the language that you're already using, then it's a little bit niftier. The other great thing is that if you have a model that's implemented in Swift, it compiles down to a .so file. You can import that into your Android app, and you're off to the races again. You'd use it just like any other library. That's pretty cool.

It also has a focus on productivity and customizability. Typed APIs, static detection of errors, you get all of the nice developer tooling that you would see in C++ world. Semantical-aware autocomplete is always a frustration for Python. We have tooling at Google that allows you to follow and trace backfiles as you're programming. It always box it, especially at the files, but being able to follow things through is quite important to us. Then, also customizable abstractions in "user space." All of these things combined make Swift really nice for developer tooling.

There's also another project that was recently started by Chris called MLIR. MLIR is an intermediate representation layer. ML does not stand for machine learning in this context, but rather multi-level intermediate representation layer. They built on top of LLVM. The idea for MLIR is that historically, we've seen a number of issues with deploying, especially to mobile targets, but also just in general. You might have a model that's been implemented in Python, and some of the ops are supported on new series devices. If you were using devices from circa 2010, many of those ops might not be supported. You deploy your model, it either does not work as expected, or it fails silently, or it outputs problematic results.

This isn't an artifact of the model not being created effectively, it's an output of the device, the deployment target that you're putting it on not being able to support some of that functionality. This is ubiquitous across all kinds of computing. Not just machine learning, but everything. We also have scenarios with TPUs where you have to have software that's specifically architected to meet that hardware consideration. Right now, the frustration for people who want to code for TPUs is that you have to hand-code everything to work effectively on those deployments.

MLIR is building out this mid-level representation that allows you to build a dialect that would allow you to specify this op can run on this hardware architecture. If it can, it will. If it won't, it kicks it back to CPU. Why is this important? This is important in the context of I could write a model using something called scikit-learn, which is a traditional machine learning framework implemented in Python. I could compile it down and run that model anywhere. It could be deployed to a mobile device, and if there's an op in scikit-learn that could run on TPU, it will. If it can't, it just gets punted back, but regardless still continues to run. Or, I could take that same model, put it on a TPU. I can take that same model, put it on a cluster of CPUs, and the distribution would be taken care of for me. It's really trying to abstract away all of the pain of deployment from the people who are actually building software and trying to make it work in real life. That's MLIR.

I strongly suggest if you're interested in that topic to take a look at the MLIR special interest group. They have open design meetings every Friday, or every Thursday, and we have open design meetings for Swift every Friday. The idea is that Swift is intended, just as it's syntactic sugar for LLVM, it's intended to be the same syntactic sugar for MLIR going forward. It's also really exciting in that this keynote blog post that I've linked here, 95% of the world's hardware manufacturers for data centers have signed on that they will agree to support MLIR. That's the Xilinx, the Qualcomm's, the Nvidia's, the Intel's, those guys. They're all on board, which is exciting to see. They're contributing code, which is also really cool.

Swift also has great interrupt with no wrappers. You can just import any library from C and immediately use it. This is partially because of the great work from the Apple team, too, because they needed Swift to be able to be their replacement for Objective C internally. At Google, we've just implemented C++ interop so you can import any arbitrary C++ header, use it and extend it directly from Swift and still see the same performance benefits that you would see from C++, which is magic.

It also means that if you have an existing code base filled with C++ or C, instead of having to rewrite everything in order to use this new language, you can just have an incremental introduction of Swift as a technology. This is incredibly important in the sense that a lot of great code has been written, and it's really silly to forego all of it just because of changing times.

We also have Python interop. A question I always get is, as a data scientist I'm very familiar with using NumPy. I love matplotlib, all of this great ecosystem of tooling that's been built around data science and machine learning. With Python, you just import Python, and you can use any library exactly the same as you would from your favorite interpreter. You can use NumPy. You can plot things out with matplotlib, and it feels very natural. Of course, you're limited in that Python is single-threaded, so you're at the mercy of the GIL. Other than that, it's quite nice.

We also give you the ability to create custom kernels. It's infinitely hackable in terms of developing custom kernels. This is an example of 1D average pooling. Also, one of the prototypes that I was most excited about recently was giving users the ability to create custom CUDA kernels from directly inside a Jupyter Notebook and Swift. If you've ever tried to do CUDA programming, historically, it is not fun. That made it look easy. Being able to have these new and noble techniques, we really think will enable some of those step-change improvements and deep learning models that we were mentioning before.

We've also integrated differentiable programming directly into the language and are in the process of upstreaming it back to Swift core. This gives you custom derivatives, user-defined types, and it's flexible whenever you need it. Any type in Swift is customizable, which means instead of having int defined by a standard library, you could define it yourself and also methods that could extend it. I mentioned before that TPUs require something very special called bfloat16. This is great for them in the sense that it no longer becomes a pain to deal with in a programming language. You can just create that custom type and use it as you desire.

We also have language integrated autodiff so that @differentiable that I was showing before that you can use to preface any function, and that, as I mentioned, is in the process of being upstreamed to Swift core.

Performance is always fun. Swift has very speedy low-level performance, often just as fast as you would get with C, which is crazy, but also awesome. We have thread-level scalability without the GIL, automatic graph execution, and performance parity with C and C++. These are all in support of those algorithmic fusion techniques that I was mentioning before.

Recently, we delivered a series of workshops with Jeremy Howard and Rachel Thomas, you can see them on the screen there, specifically focused on why Swift is a game-changer for machine learning, and what the timeline was for implementation. What you can see is, if you watch these very in-depth workshops, is that Jeremy, who is very much a Python person who famously transitioned from using TensorFlow core to using PyTorch particularly because of speed, has since migrated back to TensorFlow, specifically for Swift for TensorFlow, because the performance that he required from PyTorch wasn't there. He's been really excited to try and build out a library using Swift that would be just as understandable and fun to use as this fastai implementation in Python.

We've also worked at DeepMind with the AlphaZero team and the AlphaGo Team. The thing that they're most famous for, I believe, is creating an implementation of Go that could beat a human. Something that not a lot of folks know is, even though it's a deep learning model, that model was not developed in Python. In order to get the flexibility that they needed, they needed to implement the entire thing using C++, which, again, is very painful and did not really lead to rapid prototyping as quickly as they had hoped.

We were able to work with this team at DeepMind and reimplement AlphaZero just as performant as C++ and a fraction of the number of lines of code just over the course of about a week when we were there for their engineering extravaganza. That was very cool to see and really fun to work with the team.

That's just an example of the combination of three technologies that were required to implement AlphaGo Zero. Deep learning, Monte Carlo Tree Search, that algorithmic fusion that I was talking about before and high-performance TPUs. If you mix all of those things together, it plays a lot nicer with a typed language than it would with Python.

We've also worked in a reinforcement learning context to implement OpenSpiel, which is games with imperfect information. You might recognize Poker, Backgammon, a couple of others. Many of those algorithms are implemented in Swift. If you would like to try them out today, you absolutely can. The paper is being presented in Europe later this year, I believe, as well as the workshop.

We've also been exploring some ideas about deployment to mobile and embedded devices. This is still very early-stage work. You can imagine that it would be very interesting to deploy machine learning models and doing live training on device that would not be afforded with something like Python. What do I mean by this? In every pixel device, there's a model for adaptive brightness. That might be trained on a massive amount of data on servers somewhere and then deployed to device, but then it fine-tunes preferences based on how I interact with my cell phone. It might notice that whenever I pull up my ebook reader at 7:00 p.m. when it's dark outside, I might turn down the brightness. Or ,it might notice that whenever I pull up Spotify as I'm walking along down the street, I might make the brightness a little bit more if there's ambient brightness that's a specific amount.

It learns those behaviors over time based on the color of the app, the brightness in the room, the time of day whether I'm at my house versus elsewhere, those sorts of things. All of those personal customizations go into creating a hyper-personalized model just for me as opposed to the model that was implemented on tons and tons of user data somewhere in a server.

That is one example. Nest temperature control is also another one. It might notice that whenever you get to your house, you turn down the temperature, you turn it up, and it's usually at a certain time given the day of the week, and you could automatically start anticipating those needs of your customers. The list goes on. Hearing aids – people have hearing aids that are often paired to mobile devices, being able to make those customizations immediately instead of having a person do it is really powerful, or being able to recommend personalized insulin levels if you're a diabetic, if you have a pack.

This is really exciting to us, and we're exploring this actively. If you have ideas, we would love to hear them, especially at the Swift open design meetings. We had a couple just recently about reducing the binary size and also building performance quantized models that might be fun to check out.

Developer tooling, so not sacrificing any of your favorite developer products. You can import Python modules just as you would in Python itself. You can display plots inline. You can have user-defined types, and there's also support in VS Code. This is one of the Swift extensions. You can see the nice features like semantic-aware autocomplete there, and it works just as expected quite nice, especially if you've been playing with some of the Python extensions.

Future directions. AD is complete and being upstream to mobile and embedded devices, generic arguments, C++ interop, concurrency, and ownership. All of those things are very near and dear to our hearts, and we would love to have collaboration. The project is opensource, and all of our development is done on GitHub. If you want to get involved, it's super easy to do so.

Getting Started with S4TF

Swift for TensorFlow, as I mentioned, part of the TensorFlow organization, we have a number of Jupyter Notebooks that are available for you to test out. All of them are runnable. You just Shift+Enter to your heart's content, and all of the cells execute immediately. No need for any specialized compilation if you're operating from a Jupyter Notebook. You can stay informed by joining our mailing list at We have a page on the TensorFlow website. Then we also have GitHub, which is far and away the most active place to take a look. Swift models, Swift APIs, and Swift docks are all really interesting locations to check out.

Questions and Answers

Participant 1: Is Paige jealous of Julia?

Bailey: Julia is a programming language that has also seen considerable traction in the machine learning space. It's not as well integrated with some of the deployment scenarios as one would like. One of the things I found most interesting about Swift is that you can create one model and have it deployable to any architecture. Whereas right now, the process seems to be, you create a model in Python, you might export it to a saved model format if it's implemented in TensorFlow, which could then be converted into a TF Lite model if you want to run it on an Android device. Or, you could use Core ML to port it to a format that would be useful to an iPhone.

In the end, you end up with these very awkward parallel deployment pipelines where you might have custom logic for doing data pre-processing on-device, and that's in one language, which might be Kotlin if it's Android, or it might be Swift if it's iPhone, and then you have the logic that comprises the model. Then you have the logic that needs to happen in order to give an output back to your user. All of those things are in very different languages and very difficult to maintain over time. You might make a change to the main model, and it might not be recognized downstream. Or, you might see different performance for both of those two, like for Android devices versus iPhone devices.

Part of the magic of Swift is that you can implement the model in Swift, you can export it or compile it down and export it. Then, if you're on an iPhone, just use Swift, or if you're on a server, use Swift. Or, if you're on Android device, you can have the user interface implemented in Kotlin, but just be calling Swift as library. Julia I don't believe is super popular in terms of server-side support or mobile device application building or those sorts of things. That was really important to us.

Moderator: Could you elaborate a bit on all the differentiation, how it's different in Swift for TensorFlow? What does the compiler do? What's the magic there?

Bailey: I'm probably going to get this horribly wrong. What the compiler is able to do is it's able to understand what you're attempting to do in your function and do a lot of the work for you as opposed to you having to build up something like a GradientTape. I'm not sure how many of you have experimented with TensorFlow 2.0 or TensorFlow in general. The way that differentiation works in that scenario is that you create a GradientTape, which essentially collects your variables as you plug and chug along and get the change over time.

Then either that's exposed to you at the very end or periodically as the processing is done. Or, in PyTorch, it's through something called LazyTensor. The way that Swift works is you don't really have to think about the concept of building a tape, and you don't really have to think about collecting your variables over time. All of that's done by the compiler for you, which makes life a great deal easier.


See more presentations with transcripts


Recorded at:

Mar 09, 2020