BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Google Releases Quantization Aware Training for TensorFlow Model Optimization

Google Releases Quantization Aware Training for TensorFlow Model Optimization

This item in japanese

Google announced the release of the Quantization Aware Training (QAT) API for their TensorFlow Model Optimization Toolkit. QAT simulates low-precision hardware during the neural-network training process, adding the quantization error into the overall network loss metric, which causes the training process to minimize the effects of post-training quantization.

Pulkit Bhuwalka, a Google software engineer, gave an overview of the new API at the recent TensorFlow Dev Summit. TensorFlow's mobile and IoT toolkit, TensorFlow Lite, supports post-training quantization of models, which can reduce model size up to 4x and increase inference speed up to 1.5x. However, quantization reduces the precision of computation which can reduce model accuracy. By simulating the inference-time quantization errors during training, QAT results in models that are "more robust to quantization." The QAT API also supports simulation of custom quantization strategies which allows researchers to target their models for other platforms and quantization algorithms beyond those currently supported by TensorFlow Lite. Writing in a blog post, the TensorFlow Model Optimization team said,

We are very excited to see how the QAT API further enables TensorFlow users to push the boundaries of efficient execution in their TensorFlow Lite-powered products as well as how it opens the door to researching new quantization algorithms and further developing new hardware platforms with different levels of precision.

Many state-of-the-art deep-learning models are too large and too slow to be used as-is on mobile and IoT devices, which often have constraints on all resources--including power, storage, memory, and processor speed. Quantization reduces model size by storing model parameters and performing computations with 8-bit integers instead of 32-bit floating-point numbers. This improves model performance but introduces errors in computation and reduces model accuracy. These errors accumulate with every operation needed to calculate the final answer. QAT's insight is that by simulating these errors during training, the errors become part of the  loss metric, which is minimized by the training process; thus, the model is "pre-built" to compensate for the quantization errors.

Also, because the quantization of data inputs and hidden-layer activations requires scaling of these values, the quantization algorithm needs some knowledge of the distributions of this data; particularly their maximum and minimum values. Post-training quantization schemes often require a calibration step to determine scaling factors, but it may be error-prone if a good representative sample is not used. QAT improves on this process by maintaining statistics needed to choose good scaling factors; in essence, "learning" the proper quantization of data.

TensorFlow Model Optimization introduced full post-training integer quantization last summer, but quantization-aware training was only available as an unofficial "contrib" package. PyTorch, TensorFlow's major competitor in the deep-learning framework space, released its own official quantization-aware training tooling late last year.

In a Twitter discussion, TensorFlow product manager Paige Bailey suggested the use of the TensorFlow Model Optimization toolkit in response to the question:

Do we have an equivalent for the Deep Compression technique (originally proposed by Song Han) implemented for TF that could help me compress my model for deployment on desktops?

Deep Compression, proposed by MIT professor Song Han, is a suite of techniques that includes quantization as well as network pruning and data compression. The TensorFlow Model Optimization toolkit currently includes support for all quantization and network pruning, with support for tensor compression listed as future work on the roadmap.

TensorFlow Model Optimization source code is available on GitHub.
 

Rate this Article

Adoption
Style

BT