Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

### Topics

InfoQ Homepage News Google Open-Sources Fast Attention Module Performer

# Google Open-Sources Fast Attention Module Performer

This item in japanese

Google has open-sourced Performer, a Transformer deep-learning architecture that scales linearly with input sequence length. This allows Performer to be used for tasks that require long sequences, including pixel-prediction and protein sequence modeling.

A team from Google Research described the model and several experiments in a paper published on arXiv. The Performer uses a generalized attention mechanism called Fast Attention Via positive Orthogonal Random features (FAVOR+) to accurately estimate the standard softmax attention used in the popular Transformer model, reducing the space and time complexity from quadratic to linear. The decreased complexity allows Performers to be used in applications requiring longer sequence lengths than those supported by regular Transformers. Furthermore, the FAVOR+ attention mechanism is fully backward-compatible with existing Transformer models, an advantage over other efficient attention schemes, such as sparse attention. According to team members Krzysztof Choromanski and Lucy Colwell, writing on Google's blog,

We believe that our research opens up a brand new way of thinking about attention, Transformer architectures, and even kernel methods.

The Transformer neural-network architecture is a common choice for sequence learning, especially in the natural-language processing (NLP) domain. It has several advantages over previous architectures, such as recurrent neural-networks (RNN); in particular, the self-attention mechanism that allows the network to "remember" previous items in the sequence can be executed in parallel on the entire sequence, which speeds up training and inference. However, since self-attention can link each item in the sequence to every other item, the computational and memory complexity of self-attention is $$O(N^2)$$, where N is the maximum sequence length that can be processed. This puts a practical limit on sequence length of around 1,024 items, due to the memory constraints of GPUs.

The original Transformer attention mechanism is implemented by a matrix of size NxN, followed by a softmax operation; the rows and columns represent queries and keys, respectively. The attention matrix is multiplied by the input sequence to output a set of similarity values. Performer's FAVOR+ algorithm decomposes the matrix into two matrices which contain "random features": random non-linear functions of the queries and keys. The research team showed that this decomposition can approximate the original attention result within any desired precision, while reducing the compute and storage complexity to $$O(N)$$. Furthermore, the algorithm allows for other similarity operations besides softmax, producing a more generalized definition of attention.

To demonstrate the utility of training on longer sequences, the team used Performer to develop a protein-sequence "language model." In this model, protein "words" are represented as linear sequences of amino-acid "characters." Models trained on these sequences can be used to predict geometric information about the resulting protein molecule. The longer sequences supported by the Performer allowed the researchers to concatenate several sequences together to predict the interactions among the proteins. These longer sequences, up to 8,192 amino acids, overload the memory of large-scale regular Transformers. Smaller Transformers can be trained on the data, but achieve only around 19% accuracy, compared to Performer's 24%.

Several other schemes for reducing attention complexity have been developed recently. For example, last year OpenAI developed a sparse factorization of the attention matrices that reduces the network complexity from $$O(N^2)$$ to $$O(N\sqrt(N))$$. Google Research recently introduced the Reformer, which uses approximate attention calculation via locality-sensitive hashing (LSH), reducing the memory requirements $$O(N\log(N))$$. Google also developed BigBird, which uses a combination of three smaller attention mechanisms. BigBird, like Performer, has linear complexity, but according to Performer's creators, BigBird's "stacked" attention sub-layers make it difficult to use with existing pre-trained models, requiring re-training and "significant energy consumption." Additionally, sparse methods often require special sparse-matrix multiplication operations, which may not be available on all hardware platforms.

The code for the Performer's fast attention module and for the protein language model are available on GitHub.

Style

## Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p