Earlier this month Google went into more detail about the TPU announced a year ago. Norm Jouppi, senior architect of the TPU project and team report an order-of-magnitude performance gain in neural network execution using the TPU over some of the leading high-performance processors, specifically the Nvidia K80 and Haswell E5-2699 v3. Jouppi and team noted:
"The TPU is about 15X - 30X faster at inference than the K80 GPU and the Haswell CPU... Four of the six NN apps (tested) are memory-bandwidth limited on the TPU; if the TPU were revised to have the same memory system as the K80 GPU, it would be about 30X - 50X faster than the GPU and CPU...[and] the TPU outperforms standard processors by 30 to 80 times in the TOPS/Watt measure, a metric of efficiency."
Early motivation for the custom ASIC came from the projected compute requirements for serving Google translation API usage, noting each phone on the planet using the translation API once a day for just three minutes would require dozens of additional data centers.
The architecture white-paper captures the experimental design, data collection and analysis done against the K80 and E5-2699 as executor components for processing a range of NN's. The TPU is not currently used for training neural networks. Its primitive operation is a matrix multiplier, which is achieved with with a matrix-multiply unit and on-board memory and caching system allowing it to process multiple layers of a neural network. This includes the behavior of retaining perceptron output and state between layers, which is used heavily by MLP's and CNN's.
The TPU isn't bound to a single NN implementation though; it is meant to be generic, but its architecture is based on the comprehensive range of use-cases Jouppi and team researched. Part of the motivation for this revolved around the time-line for delivering the TPU, but also the needed flexibility to optimize for matrix-floating-point operations as the most primitive operation the chip performs. Integrating the TPU back into a CPU/GPU architecture that could house the rest of the application was conveniently achieved by using a PCIe bus.
The architecture requires CPUs/GPUs to conduct training or any part of the TensorFlow application outside of executing the neural network. For example, if an application has to front-load data or apply logic to feed instructions to TensorFlow execution, those would have to be managed by the CPU/GPU and sent to the TPU. In this sense, the TPU is a lot like a graphics card or FPU.
"The TPU is programmable like a CPU or GPU. It isn’t designed for just one neural network model; it executes CISC instructions on many networks (convolutional, LSTM models, and large, fully connected models). So it is still programmable, but uses a matrix as a primitive instead of a vector or scalar."
Given the deterministic nature of the TPU compared to time-varying optimizations on CPU and GPU architecture, the TPU outperformed benchmarking chips on TOPS/Watt, or Tera-FLOPS per Watt ratio. The TPU reportedly outperforms standard processors by 30 to 80 times in the TOPS/Watt measure.
"The TPU leverages the order-of-magnitude reduction in energy and area of 8-bit integer systolic matrix multipliers over 32-bit floating-point datapaths of a K80 GPU to pack 25 times as many MACs (65,536 8-bit vs. 2,496 32-bit) and 3.5 times the on-chip memory (28 MiB vs. 8 MiB) while using less than half the power of the K80 in a relatively small die. This larger memory helps increase the operational intensity of applications to let them utilize the abundant MACs even more fully... Order-of-magnitude differences between commercial products are rare in computer architecture, which may lead to the TPU becoming an archetype for domain-specific architectures."
As part of the research phase of their experimental design, Jouppie and team researched NN usage across Google's platform and found requirements for time-sensitive applications over throughput-sensitive applications more so than they had initially thought, leading them to realize that a low utilization of a huge, cheap resource can still deliver high, cost-effective performance, echoing Amdhal's Law.
The TPU experiment profile consisted of six neural networks: two MLP's, CNN's and LSTM's. The two MLP's and LSTM's are memory bound, thus adjusting memory bandwidth throughout permutations of the experiment had the most pronounced affect on performance. Some of this can be attributed to MLP's and CNN's reusing weights from prior layers, and to some extent also reusing outputs of prior layers. LSTM's, on the other hand, reuse weights across time steps but only selectively retain prior-layer outputs, making them less memory-bound and more compute-bound performance ceiling. This still held true when factoring in I/O bandwidth on the PCIe Bus used by the TPU.
With improvements in memory and caching described in a hypothetical TPU-prime architecture based on a longer-than 15-month schedule for delivering the TPU in its current form, Jouppie and team claim they'd achieve a factor of thirty to fifty times performance improvement over the K80 and E5-2699.