Google Open-Sources New Higher Performance TensorFlow Runtime

Google open-sourced the TensorFlow Runtime (TFRT), a new abstraction layer for their TensorFlow deep-learning framework that allows models to achieve better inference performance across different hardware platforms. Compared to the previous runtime, TFRT improves average inference latency by 28%.

Eric Johnson, TFRT product manager and Mingsheng Hong, TFRT tech lead, gave an overview of TFRT in a blog post. TFRT's role is to execute kernels, device-specific primitive operations, which are written for specific hardware. Eager execution invokes TFRT kernels directly, while graph execution requires the graph to be "lowered" to an intermediate representation before invoking the kernels. Other improvements include more support for concurrent execution and improvements to the code's extensibility. According to Johnson and Hong,

A high-performance low-level runtime is a key to enable the trends of today and empower the innovations of tomorrow.

TensorFlow was originally conceived as a language for building a computational graph that represented a chain of transformations on large data arrays. The graph would be built using Python instructions, but lazily-executed later, during a session. This separation makes it easier to map the computation onto underlying parallel hardware, but it does have some drawbacks; for example, errors in the graph may not become apparent right away. In late 2017, inspired by rival framework PyTorch, Google added eager execution to TensorFlow, enabling faster debugging and more dynamic models which incorporate control flow. However, the existing runtime is still optimized for graph execution, and the overhead of executing a single eager op is "non-trivial." Johnson and Hong commented:

Whereas the existing TensorFlow runtime was initially built for graph execution and training workloads, the new runtime will make eager execution and inference first-class citizens.

Graph execution is still supported. TFRT improves graph execution performance by using a custom compiler framework, Multi-Level Intermediate Representation (MLIR), to convert the graph to a platform-specific program to be executed by the runtime. Google added support for MLIR to TensorFlow in 2019, and the new runtime is "tightly-integrated" with it. In addition to compiling graphs, MLIR also supports arbitrary C++ types, which increases the runtime's extensibility.

TFRT Architecture - source: https://github.com/tensorflow/runtime

TFRT is one of many recent improvements made to the TensorFlow framework to increase inference performance, including TensorFlow Lite and the Model Optimization Toolkit. TensorFlow Lite also converts models to a form targeted for specific hardware but focuses on resource-constrained processors such as mobile and edge devices. TFRT, by contrast, aims to improve model inference across all platforms, including the cloud or datacenter, and includes targets such as GPUs and high-end CPUs. To measure the improvements to inference latency, Google integrated TFRT with TensorFlow Serving, a production-grade serving environment for model inference. For their experiment, they chose a ResNet-50 model and executed it on TFRT and the previous runtime. TFRT's average inference time improved 28%.

In a discussion on Twitter, product manager Johnson noted that support for NVIDIA TensorCores was on the roadmap. Another user said,

TFRT is to TensorFlow what the JVM is to Java. Look forward to other frameworks being able to run on TFRT.

TFRT and TensorFlow framework source code are available on GitHub.

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter